Monday, 29 April 2013

Keeping it Agile: the secret to a fitter ontology in 4* easy** steps!


*there are probably more than 4
**it's not all that easy

I've been preachy recently in complaining about how the ontology world doesn't apply enough software engineering practices in producing ontologies. I thought it was about time I explained some of the things I thought they could do by talking specifically about the things we do here to help us. There's an expanded version of this in a paper accepted for 2013 OWLED workshop for those attending.


1. Whatcha gonna do?


First thing we steal from software engineering is our overall methodology. I have talked a bit about this previously at ICBO 2012 where I presented on how we applied Agile Software Engineering Methods to the development of the Software Ontology. There are a few things this gives us. It helps us prioritise development. Collecting requirements is not usually a problem - there are always bucket loads. As with most projects, there is always more work than people and we need to focus on the things that are most important - which can change month to month.
The red stuff means we're doing it right (that is,
we're catching the stuff we're doing wrong early).

We use a few agile methods to help with this. Priority poker and buy-a-feature have been of particular use when engaging with users and also reasonably fun to do. It also helps keep our major stakeholders involved with the process of development, which is useful because it means there are no big surprises at end of each sprint (i.e. cycle of development). This way everyone knows what we're gonna do and so do we.

2. Building a house with bricks on your back


One of the primary ontology I'm currently involved with developing is the Experimental Factor Ontology. EFO is an application ontology - that is to say it is built to serve application use cases, distinct from a reference ontology which are built as a de facto reference for a given domain. When building EFO we try to reuse as many reference ontologies as we deem suitable (I won't expand on what this means here). But needless to say, our reliance on external resources introduces a coupling - in the same way it does in software projects using libraries. I often refer to this as trying to build a house with the bricks strapped to your back; nice to know you have them close by, but they're heavy. We have some importing code we use to help us manage these imports, based on MIREOT. This still gives us issues to look out for. For example, there is much variation in annotation property names, for example for 'synonyms', so we need to merge these so our applications know where to find them. Where imports are not possible or suitable, we mint new EFO classes. Since multiple developers from various sites can mint these, we have built some tooling for central URI management to avoid clashes which could otherwise easily occur. URIGen is this tool - see my previous blog for more on this.

To keep track of external changes and to produce our release notes we use our Bubastis tool which does a simple syntactic diff across two ontologies to tell you what's changed, been added and been deleted. Keeping track of what's going on externally is a complicated process and brings baggage with it. There is a discussion to be had as to when the balance of keeping track introduces an unacceptable overhead as you are effectively at the mercy of external developers. Examples of changes we've had to deal with include: upper ontology refactoring, mass URI refactoring, funding ending, general movement of classes, changes to design patterns (and axiomatisation therein) and so on. For what it's worth I think we're in a better place now than we started building EFO five and a bit years ago, although my opinion on this will change if the new (non-backwards compatible) BFO temporal relations are adopted.

3. Test Driven Development


Another agile process we adopt is test driven development. In a continuous integration framework, it is necessary to test each commit of code to ensure that it does not break previously working components and introduce new bugs and we treat OWL with the same respect. We have developed a series of automated tests using Bamboo that the ontology is ran against after each commit which performs checks such as for: invalid namespaces; IRI fragments outside accepted conventions; duplicate labels between different classes; synonyms duplicated between classes; obsolete classes used in axiomatisation; unit tests for expected class subsumption (e.g. cancer should be subclass of disease).

4. Design Patterns


Another aspect is performance and the OWL DL profile we restrict to. In order to fully exploit the querying power of the ontology, we use reasoning to infer various hierarchies of interest, such as classifications of cell lines by disease and species, and we need this to happen in a time that is responsive. There are several methods we use to ensure this remains the case. The first is the use of design patterns. We restrict axiomatisation to a set of patterns that we have developed to answer our priority competency questions. The second is to disallow the addition of new object properties and characteristics on those properties. The third is to classify the ontology on every commit (and run the above test code). HermiT gives us best performance for this interested in reasoning and has done for quite some time now.


We also employ an automated release cycle to release a new version of EFO monthly, in order to best coordinate with our application needs. The release is programmatically performed using a Bamboo build plan which performs tasks such as creating the inferred version of the ontology, converting the ontology to
OBO format, publishing files to the web, building the EFO website and creating URLs for classes in the EFO namespace to ensure that concepts described in EFO fully dereference.

Agility, reality, profanity


Our overall approach has improved the production quality immensely over the last few years. To quantify this with an example: over our last 3 months of work, 74% of the time our EFO continuous integration testing has passed on check in. This means that 26% of the time it has not. Although this sounds like a bad thing it's actually good to know we're catching this now before it goes for release to applications. Much of this is actually relatively minor stuff like white space in labels which we are fairly strict on but sometimes it's more serious stuff that we're glad we caught.

We've also become more dynamic in prioritising and sharing tickets meaning more important stuff gets done more quickly and by a number of people, tickets being picked off the top of the priority pile as people become available.

We still struggle with a few things and these are challenges that hit most ontology consumers I think. The biggest is balancing correctness with 'doing something'. This is a tricky brew to get right as we don't want the ontology to be wrong, but we do want to get things out and working as quickly as possible. Thinking about the metaphysical meaning of a term over a period of months does not help when you have data to annotate covering 1,000 species and 250,000 unique annotations as your target; this is the reality we face. In the same breath though, getting things very wrong doesn't provide the sort of benefits you want from using an ontology - and using an ontology adds an overhead so there should be benefits.

There is a dirty word in the ontology world that most dare not utter, but we do so here; 'compromise'. We do stuff, if it's wrong we fix it, we release early, release often and respond rapidly to required changes from users. Sound familiar?

Monday, 4 February 2013

When will bio-ontologies grow up (and how will we know)?

Robert Stevens and I have published an experiment we did on evaluating levels of activity in bio-ontologies over the last decade. It felt like a decade ago since we did the work such was the delay by the journal in getting it out. Here's a summary of the full paper

At ICBO 2011 in Buffalo, NY, Robert Stevens and I were chatting outside my Hotel about which ontologies we use in our work and how one makes a choice. A few others there - I recall +Melanie Courtot and +Frank Gibson were also present - also had thoughts. There was a collective wisdom about ontology maturity, development and engineering in what we said and we felt it probably would change the landscape of this area of research forever. I wish I had been able to remember any of it.

Nevertheless, Robert and I went ahead and performed a bit of work looking at one aspect of ontology evaluation to see if we could glean some insights into the constitution of the various bio-ontologies in existence and see how far we had come. We limited our work to looking at what we called 'activity'. Some of the research questions we wanted to investigate were:

  • How frequently is an ontology updated?
  • What do these changes look like?
  • Who makes these changes?
  • Is there attribution for changes?
  • Can we see patterns (profiles) as ontologies mature?

"Ontology activity" by Aureja Jupp. 
Our method for doing this was relatively simple. Firstly, find the available ontology repositories. Secondly download the ontologies and record data about it - date, commiter etc. Thirdly, perform a syntactic diff between subsequent versions looking at number of classes added, deleted and that have add axiomatic changes (for example, have a new parent class, or part of assertion made on them). Finally, perform a bit of analysis on these results.

Activity and the Super-Ontologist

We performed the diff using a tool I had written several years ago now called Bubastis - there's also a Java library available since we did this work. The tool is fairly simple; it uses the OWL-API to read in OWL or OBO ontologies and performs a class-level set difference on axiom assertions. It also looks for newly declared named classes and similarly for named classes missing in previous versions.

I'm not going to go into everything here, you can read the paper for all the details, but here's a few of the interesting things we found.

1. Most activity in the community is in refining classes that already exist within an ontology. Alongside this, we also found that a lot of classes were deleted between versions which is in contrast to the perceived wisdom that classes are only made obsolete and not removed. It is arguable that for the OBO ontologies this is less of a crime; when we look at the details we can see some of these deletions are caused by name space changes between versions with the ID fragment at the end (e.g. the GO_1234567 bit) not changing. Nevertheless, this is a problem if one chooses to use OWL versions or use the full URIs for these ontologies.

2. Between 2009 and 2011 ontology activity remained fairly constant. We produced a metric by totalling all the changes and compared the two using a paired t-test which suggests that levels have not changed significantly.

3. Active ontologies tend to release often. This is perhaps not surprising to anyone familiar with software practices of releasing early and often, but was good to see. The perceived wisdom in software engineering is that this allows for rapid feedback from and response to users - something active ontologies have adopted.

4. Some ontologies may be dead. Dormancy may suggest the ontology is inactive or complete. The Xenopus anatomy ontology is currently inactive and is more likely to be a case of completeness rather than death. Nevertheless, monitoring when an ontology becomes moribund is almost certainly a worthwhile endeavor for efforts such as the OBO Foundry since an ontology could end up occupying an area, thereby preventing progress.

5. Lots of committers does not always lead to lots of activity. There are several factors to consider here. Firstly, a few of the project use automated bots to release the ontology so the committer name is not a good indicator of the number of editors. Nevertheless, there are some projects which contain many different committers which have low levels of activity and vice versa. This may be suggestive of many things - that large collaborate efforts suffer from "consensus paralysis" - but it may also suggest that tracking tacit contributions is difficult. Which raises other issues, namely of appropriate credit for such contributions. We arrived at more questions than answers with this one.

6. There lives amongst us a Super-Ontologist. The committer 'girlwithglasses' has made a total of 500 ontology revisions spanning 13 different ontology projects; She is truly the Ontological Eve.

Discussions

We had many discussions when writing this work up as to what it all meant, and we've put most of this into the paper - if you're interested you'd be best of heading there. Perhaps, more than anything, the conclusion I was drawn to from all of this work was that ontology engineering is still immature. We leaned towards software engineering when we undertook this work; maturity profiles, diff analysis on code revisions, etc. but we were feeling our way towards what might be a good way of summarising this aspect of what you might call ontology quality. Software and ontologies are the same in many aspects but different in many others but I still think this is where our best hope lies in applying QC.

I've previously discussed why the lack of ontology engineering practices is a problem and I think we need more quantifiable approaches for developing and evaluating ontologies. In a workshop I attended in September one of the top desires of the biologist user community was advice on what ontologies they should use. I'm always tempted to name the ontologies I know and use. When we first started using ontologies we performed an analysis of coverage across the various ontologies to work out which would offer us most. Coverage is one metric - a simple but important one - but now there are so many ontologies which offer coverage, we need more than this to inform our decision. Our ontologies are growing up. We should probably help to point them in the right direction.

Sunday, 9 December 2012

After the hype, the biology (and finally, the cancan)

I spent an enjoyable three days in Paris at this year’s SWAT4LS, meeting and talking with a lot of interesting people working in and around the semantic web community. Here are a few observations I made.

There were a lot of people.
It was my first time at this particular meeting and it struck me that there were a lot of people in attendance, somewhere around 100. I was surprised, though perhaps I should not be as the community has been steadily growing over the last few years. Still, this suggests this is no longer just a niche activity and has many more paid-up members than ever before. OK, I accept the fact it was in Paris might have swayed a few hearts and minds, but still.
They were cancaning in the aisles at Moulin Rouge
when they heard about all the RDF in town. 

There was a lot of interesting work…
I was pleasantly surprised at how good a lot of the work presented was and it spoke to much of what we have been doing over the past six months. To pick out one, I was impressed with a talk given by Matthew Hindle on data federation on Ecotoxicology using SADI in which he outlined the way they utilise the services to start to answer what I would call proper biological questions. It was a nice example of where they had been able to pick up some of the existing approaches and resources and apply them to their data and problems successfully and speaks to the notion that the field has matured some over the last few years. Of course I've read this claim numerous times in the past but largely from an anecdotal point of view; this was at least evidential that, to some extent, things exist that can be used to solve problems in the life sciences. There was inevitably some work to do of course, they did not find everything as an out of the box solution, but the components were there at least. I also liked the SPARQL R package that has been recently published and that we've been using in an MSc project with one of our students. It's been very useful  and we've written a package for analysing our Gene Expression RDF data in more intuitive ways, allowing simple R-like function calls to be made with the RDF, behind which the SPARQL lies, hidden. I think this sort of tool is important because it exposes the technology to an important, existing bioinformatic community in ways that aid and not hinder their work. We’ll be releasing this tool early next year. UniProt also presented their work on using rules within their RDF to detect inconsistencies within the annotations. I like this a lot and it's something we have also started exploring with our own RDF, such as looking for disease annotations that were made to ontology classes which were not subclasses of disease - we found a couple. This demonstrates nicely the advantage of having your data in a format that is native to the ontology. This enables one to ask meaningful ontology type questions (subclasses of x, classes part_of y, etc.) rather than having to formulate some hack between a database query and an OWL one and then do a textual comparison.

…but nothing that got me really, really excited
I didn't at any point have that epiphany that made me think "we're there". I've read much hyperbole of this sort far too often in publications (numbers of triples does not alone equate to good science), and at the moment it still remains just that. I do think it's maturing and I think this field is now becoming Important to those working in the life sciences. Certainly here, at EBI, we've been working a lot on RDF representations of data over the last six months and I see this from others too. But it doesn't underpin the basic science we do. It’s not that important to our databases, our curation, our applications and most importantly our users. It may become so - I think it probably will - but it's not there yet so, for now, read those aforementioned publications with skepticism.

There is much more to be done; opportunity, hard work and money
I see a lot of interesting opportunities in this area but, for my mind, there needs to be more engineering methods applied if we want to see less bespoke solutions occurring that live and die within the length it takes a paper to be written, accepted and published (in particular SPARQL end points). I've said this many times about engineering ontologies and I think it equally applies here. We need better update models – updating RDF once it's in a triple store seems to be too onerous a task at present. And documentation on how to do this stuff is really lacking. The HCLS W3C group that we have been working with have been having a go at writing a note on representing gene expression data in RDF but it's slow work and I still don’t know if we’re on the right lines. Which also suggests we need better ways of evaluating this stuff. What does it all mean once it's out there, in RDF? Can I use it, integrate with it and how? Most importantly, is it correct? It's not just about sticking a URI on everything and making it RDF – that's too simple if we want to use this in more computational approaches to analysing and solving real biological problems. One of the big problems I've found in this area is convincing funding agencies that these approaches can result in discoveries in biology when the evidence for this is conceptually strong but practically weak. As is often the case, we have a scenario in which waiting for discoveries before giving more funding will mean the discoveries never come, because the work is never funded. If I had a single want it would be some focused calls for this stuff across the likes of the BBSRC and EU Framework 7 with some specific biological objectives. There is hope on this front - Douglas Kell gave a plenary talk at last year's SWAT4LS so there is some recognition of the importance of this area.

My comrade in arms Simon Jupp gave a talk recently on the RDF work we have been doing (mainly Simon) which I added a subtitle to: ‘after the hype, the biology’. The hype may be fading, the biology may be surfacing but there is still much to do.

*Congratulation to Tomasz Adamusiak on winning an iPad in the Nanopublication competition. He celebrated Tomasz style; an evening at the Moulin Rouge.*

Tuesday, 30 October 2012

Future Ontology - Five year predictions past and future

I made a number of ontology predictions five years ago to this very day. Here's a review of those and a few more for the next five years.

On 30th October 2007 I gave my first public presentation on the ontology work I'd been doing since starting work at EBI in May that same year. Today, 30th October 2012, I gave a talk during which I reflected back on things that have passed over those five years and, in order to do so, ended up looking at that old set of slides from 2007. It made for interesting reading. At the end of the talk I made some predictions about what we might do and where we might end up as a community. I thought it might be nice to share those now and to make a few more public, five year predictions. If I'm still working and the world has not ended in 2017 maybe I'll do it all again.
I also predicted that leather pants would make
a comeback. I was wrong on that one, thank God.

October 2007 - My Future Ontology Predictions


We will rely on and reuse external URIs in our work, rather than minting our own, as ontologies become more populous and stabilise.

This is certainly true of the work we do at EBI. Some ontologies, such as Gene Ontology, have been stable for quite a while and a couple of others have also followed a fairly stable road to persisting ontology URIs over time (including our own EFO which follows the GO's practice). What I really wanted to see was that once a URI is minted in an ontology it persists unless there is a very good reason for it to go away. Many more bio-ontologies do this than used to and in some ways, this is a measure of maturity of the community.

We will add dereferencable URIs for our data and put metadata behind it.

Partially. This happens for most ontologies now which is a definite step in the right direction. For example, OntoBee does a lot of content negotiation for OBO ontologies which is nice. The data side still lags behind but we are looking into that internally now. I suppose overall the data part is less true than I would have wished but this is a classic chicken and egg. The promise comes later, when everyone does it, but until everyone does it, it is not immediately obvious why you should. We need to be bold.

As ontology numbers increase and overlap, mapping between them and data described with them, will become our biggest challenge.

I think this is true and I am still concerned by it. Where this is not true is that there is not a huge amount of data published using all of these ontologies as I perhaps envisaged in 2007. But I still maintain that when more is published, the mapping problem will be difficult. Having said all of that, I am also unconvinced that building one ontology for each domain by attempting to get all communities to agree to every definition of every biological concept is the answer either. I've seen ontologies grind to a halt over analysis-paralysis over the last five years and this is also not the way to go. Sacrificing one critical problem for another is not a good solution.

Agent technology will help in our mappings and in the way we discover data.
Pretty much didn't happen - in bioinformatics anyway. But I think this was because I was overly-optimistic about how much of the infrastructure would exist in this semantic web world. It is worth though that Google's agents (web crawlers) do use rich snippets tags which includes an RDFa version about products on web pages to help populate the 'shopping' search you see. So I was a bit off the mark but not completely.

That biological triples would be championed by all.
I think this has definitely been wrong up until very recently and in some ways I am guilty of being sucked in by the hype - by the promise that integrating all of our RDF data would bring. This year though, the EBI has started an RDF Frontier group to trial this work. You can see the work Simon and I have be doing at our FGPT Atlas RDF page to see how this is progressing - well so far I'd say. I was a bit premature on this prediction. Which leads me on nicely to...

October 2012 - My future ontology predictions for the next five years


The number of reference ontologies will level off (and some will disappear) and natural 'winner' will emerge.
I make a deliberate distinction here by saying 'reference' ontologies as I think the number of ontologies put together for applications will likely increase but they are unlikely to be considered as references for a domain. I think funding for building these reference ontologies will fall and some may even become moribund sadly. But a lot of the important ones will live on and continue to develop. The natural winners - ontologies that become the de facto choice for a domain will emerge. We will need to find ways of using these ontologies that does not necessitate building a whole new reference ontology.

Upper ontologies will play a less important role in the community.
Some might say 'about time' but I do think they've had a role to play and have helped with some things even if the approach of those involved has been, shall we say, less than endearing. But I think their domination in every discussion about whether an ontology is 'good' and how an ontology fits into an upper ontology will decline in favour of focusing on how we can use the ontologies to describe our data and do biology.

Use of ontologies and semantic web technologies in Bioinformatics will become ubiquitous.
This is ambitious but I think it should and will happen. I'm convinced, even from some of the early prototyping work we've been doing, there is enough data out there now that warrant applications for biologists to use.

Publishers will curate literature using ontologies and make the API to these annotations public.
I think the great work that happens already in GoPubMed should happen for more ontologies and for more publishers. It's just great. For more on this I refer you to Phil Lord who wages a one-man war for more semantics in publications (amongst other improvements in the industry).

Google will endorse the semantic web.
Or they will at least admit that it's useful. They already use semantics this with rich snippets. I'd like to see them support this area of web science and I think they, eventually, will. If they outwardly endorsed this, then who knows, certainly I think more people would use semantics when publishing their data on the web.

I'd love to hear more from those in the community who are willing to stick a stake in the ground.

Thursday, 13 September 2012

Bringing Genome Wide Associations to Life

A year in to the EBI's joint project with the NHGRI, the GWAS catalog and data visualization has become interactive with help from Semantic Web technologies.

In the fall of 2010, we started an informal collaboration with the NHGRI's Catalog of Genome Wide Association Studies (Hindorff et al, 2009) with our student, Paulo Silva. In that project, he rapidly created an ontology based on our ongoing EFO efforts to describe a lot of the traits in GWAS (see the NHGRI site for more on how the catalog is put together). Paulo deployed this using the existing ArrayExpress website infrastructure as a backbone to illustrate the benefits of using an ontology for curation and searching. Later that year, Paulo and Helen Parkinson presented this work to the NHGRI.
GWAS Catalog Diagram 2011
Figure 1. GWAS Catalog as of 2011. Much skilled work put this beautiful artefact
together. "My God, it's full of stars!"

Thankfully, they liked it and in October 2011 the EBI started a formal collaborative project with the NHGRI to improve the process of curation based on our expertise in using ontologies for annotation and in deploying them in applications. In addition, the famous and much-cited GWAS catalog diagram (see figure 1) was also to be given an overhaul in the way it was generated. The diagram illustrates GWAS traits associated with a SNP on a band on a human chromosome, and is generated manually by a skilled graphical artist every three months. If you take a look at the diagram you can quickly see that this is no easy task. New traits can appear (almost) anywhere each month and this can mean a lot of shuffling around of coloured dots to accommodate them. In addition, each trait has a unique colour which is described by a complex key (see figure 2)  - complex because there are so many traits and therefore a lot of colours required. In fact, the harder the team work to curate this data, the harder it becomes to generate the diagram. The lazy person in me thought the answer was simple - stop working so hard!
GWAS Catalog key with many different colours for the many different traits
Figure 2. The key keeps growing as more work is included. Each of
these colours is unique. My favourite is #99FFFF

But they continue to work hard regardless and so a further issue presents itself; that searching the catalogue, either by the search interface or by viewing the GWAS image, is also hampered by the growing size and lack of structure to the trait names. A small list and this is not really a problem, but as the catalog has grown thanks to the curation efforts of the NHGRI, so have the amount of traits.

Motivation complete, let's look at the progress so far. You can click here to see the new diagram generated by the team at EBI* (please see below for the list of contributors). The first thing to note is that it is now an interactive, dynamic diagram. You can zoom in and out, you can mouse over traits and find out what they represent rather than have to look at the key. The default key has also been reduced down to higher level groupings, significantly reducing the number of colours used. These groupings correspond to ontology classes in the EFO which are superclasses of other traits and are generated based on maximum coverage (i.e. the 18 classes that covered the most amount of traits). One of the advantages of using an ontology emerge here; that you can begin to aggregate data together and get a better feel for global trends.

In addition, you can also drill down to more specific traits using the filter option. Because the diagram is dynamic, you can highlight only those traits of interest to you. Try entering 'cancer' into the 'Filter' box and press enter. The ontology is being used to show those SNPs annotated to cancer or subclasses of cancer. Now try 'breast cancer' and you can see a subset of these cancer trait dots remain highlighted. What about 'breast carcinoma'? Have a go. Clear the filters (click clear filter) and then enter 'breast carcinoma' and you'll see the same results as for breast cancer. Again, this is ontology at work; the browser is using synonyms stored in the ontology to perform the same query it does for 'breast cancer'. Simple, but very useful.

The benefit of generating the diagram programmatically is perhaps most evident in the time series view (click on the tab). A selection of diagrams from over the last seven years is shown here and all were generated with just a few clicks. This is in much contrast to the manual, hand-crafted artefacts that went before which took many weeks and much skill to produce. Seems almost unfair really.

The Techy Bit


So what's going on behind the scenes? The list of over 650 highly diverse traits, are mapped to EFO. These traits include phenotypes, e.g. hair colour, treatment responses, e.g. response to antineoplastic agents, diseases and more. Compound traits such as 'Type 2 diabetes and gout' are separated. Links between relevant traits are also made to facilitate querying (e.g. partonomy). It's also worth mentioning that EFO reuses (i.e. imports from) a lot of existing ontologies such as the Gene Ontology and Human Phenotype Ontology which also facilitates future integration.

The GWAS ontology is a small ontology used within the triple store to describe a GWAS experiment structure (not including traits) and describes: links between a trait and a SNP and the p-value of that association; where that SNP appears on a chromosomal band; and on which chromosome. Each chromosome is an SVG in which, the XML source has been tagged with ontology information, e.g. these coordinates of the SVG are band x. The dots representing traits are similarly assigned the trait type (from the ontology) within the class attribute of the SVG. These bits of information are used to render each trait on the images, and allows for the filtering on trait - and (in the very near future) on other properties such as the association p-value.

DL queries are also employed as part of this visualisation, making use of the OWL describing the two ontologies (EFO and GWAS ontology). The initial creation of the diagram uses a DL query which asks, amongst other things, which bands are on which chromosomes, which SNPs are on each band and which traits should be rendered where. Traits are selected using another DL query which, as well as location information, asks if they satisfy a certain p-value (which is a datatype property) - those below this threshold are not rendered. In the future this will become more dynamic so a user can specify their own p-value threshold.

These queries all take a while so they are all pre-computed when the diagram is built each release and then each SVG is cached on disk for quick retrieval. The actual filtering queries (which will also become auto-complete in the near future) also use a simple DL query; they ask for all the subclasses of a given trait class from EFO and these traits are passed back in a JSON object which is used to refresh the diagram, showing SVGs of this class. In the longer term, the aim is to use disk caching of the reasoner using something like Ehcache. Currently this is not possible due to some serialization issues with the version of the OWL-API that is being used but this is set to change. This will enable much more complex queries to be performed, utilising EFO, dynamically such as traits that may be the subject of a protocol of measurement or diseases that affect a particular system. There are many possibilities.

A side-effect of this project is that the technology is reusable for other projects and indeed we intend to utilise some of this for some of our current work in rendering the Gene Expression Atlas in RDF (more on that very soon). The model is relatively simple; describe your data in some schema (OWL ontology), work out what each ontology looks in SVG (the relationship between an ontology class and the image) and then render it.

This is all done using Semantic Web technologies; OWL, RDF, triple stores, ontologies, description logic reasoners. And you'd never know looking at the website. I've always thought this is the perfect example of when Sem Web comes good - it just works and the user never knows about the guts. This is the right way round I think; the biology should come first in this application, the technology is secondary. 99% of users won't really care that your website uses CSS or JavaScript the same should be true here. When Semantic Web technologies hit this level of brilliant but quiet, accepted ubiquity, we have reached our goal.

*List of contributors
The tooling for this project was developed by Danielle Welter and Tony Burdett, with input from Helen Parkinson, Jackie MacArthur and Joannella Morales at the EBI and the catalogue curation team at the NHGRI.