Zooma - Like an Ooma but even faster. |
This is a much valued service by a lot of the community, but it is not without issues. One of the primary issues is that data submitted to ArrayExpress continues to increase. In fact, even though it has been postulated that microarrays are dead there is no sign submission of these experiments is slowing, in fact our figures show they are on the whole stable. On top of this, new sequencing technologies are emerging almost monthly it would seem and our figures also show a slow but steady increase in this sort of submission. Overall then, this leads to a net increase in the amount of submissions coming into ArrayExpress.
So business is good, but this is not without its drawbacks. The primary one is cost and I mean this in the broadest sense; high quality annotations are time-consuming and there is a limit to how many experiments a curator can curate. Simply put, more experiments means more resources are required. We call this an annotation gap, i.e. the gap between high-quality data annotation (especially using ontology classes) and the amount of resources available to do such annotation.
Zooma is an RDF knowledge base of annotation knowledge, extracted from the expert curation performed on a subset of ArrayExpress data. This subset of data has the added advantage of being curated twice because it has also been loaded into the Gene Expression Atlas, where it has been aligned to ontology classes in EFO. This is very powerful for several reasons.
Firstly, it enables access to the curation process. This is useful because it allows a person to easily look-up how a property (some textual item) has been annotated to an ontology class and therefore repeat the process - the users here are both external submitted and our own curation team (see image). This makes curation consistent and rich. An additional benefit of this is that it also enables computational exploitation of the curation process. Not only does Zooma capture how a textual property has been mapped to an ontology class, but it also captures corrections between annotations, for example an update to a more accurate class. What this really gives us is a big set of rules, manually created over several years, which can be applied to data automatically.
A second feature is that provenance is stored and used for ranking and filtering. Using the Open Annotation Model, where an annotation 'rule' has come from, for instance curator asserted or inferred from the knowledge base, is recorded.
A third feature is that everything has a URI. Every annotation, sample, assay, study and they link to experiments in ArrayExpress and the BioSample Database. So this is truly linked (to) data.
Finally, this model is additive. Not only can our own, new additions be added by our curators and submitters, but any annotation based on our simple abstract model can be incorporated - including a whole database dump.
Have a play with the live demo at http://wwwdev.ebi.ac.uk/fgpt/zooma
Tony Burdett will be presenting this work at ISMB Technology Track on Sunday July 15th at 15:30.
This work is supported in part by EMBL and by the DBP project with NCBO via NIH.
No comments:
Post a Comment