Friday, 13 July 2012

A million gene expression annotations with Zooma

ArrayExpress, based at the EBI, is one of the world's largest public repositories of transcriptomic data. One of the much valued features of the repository is that data submitted undergoes curation, not only by computational assessment but ultimately by manual experts - our curators. Their job is to ensure this data meets certain minimum quality requirements and is described in a way that is accurate (ontologies help with this) and therefore searchable in the archive.

Zooma - Like an Ooma but even faster.

This is a much valued service by a lot of the community, but it is not without issues. One of the primary issues is that data submitted to ArrayExpress continues to increase. In fact, even though it has been postulated that microarrays are dead there is no sign submission of these experiments is slowing, in fact our figures show they are on the whole stable. On top of this, new sequencing technologies are emerging almost monthly it would seem and our figures also show a slow but steady increase in this sort of submission. Overall then, this leads to a net increase in the amount of submissions coming into ArrayExpress.

So business is good, but this is not without its drawbacks. The primary one is cost and I mean this in the broadest sense; high quality annotations are time-consuming and there is a limit to how many experiments a curator can curate. Simply put, more experiments means more resources are required. We call this an annotation gap, i.e. the gap between high-quality data annotation (especially using ontology classes) and the amount of resources available to do such annotation.

Zooma 2 User Interface screenshot
Search the Zooma KB for an annotation. The pop out
here is showing info for the first hit for Caucaisan used as
as a value for a category ethnicity. This pattern has been
used 608 times and each annotation has its own URL. 
One of the ways of reducing this annotation gap is by enabling submitters to annotate their own data more easily and more aligned to a common standard, in this case the ontologies we use. This reduces the effort required by curators to make sure everything is aligned within the repository. Another way, is to maximise the amount of automated annotation against such ontologies that can be done. The is the job of Zooma.

Zooma is an RDF knowledge base of annotation knowledge, extracted from the expert curation performed on a subset of ArrayExpress data. This subset of data has the added advantage of being curated twice because it has also been loaded into the Gene Expression Atlas, where it has been aligned to ontology classes in EFO. This is very powerful for several reasons.

Firstly, it enables access to the curation process. This is useful because it allows a person to easily look-up how a property (some textual item) has been annotated to an ontology class and therefore repeat the process - the users here are both external submitted and our own curation team (see image). This makes curation consistent and rich. An additional benefit of this is that it also enables computational exploitation of the curation process. Not only does Zooma capture how a textual property has been mapped to an ontology class, but it also captures corrections between annotations, for example an update to a more accurate class. What this really gives us is a big set of rules, manually created over several years, which can be applied to data automatically.

A second  feature is that provenance is stored and used for ranking and filtering. Using the Open Annotation Model, where an annotation 'rule' has come from, for instance curator asserted or inferred from the knowledge base, is recorded.

A third feature is that everything has a URI. Every annotation, sample, assay, study and they link to experiments in ArrayExpress and the BioSample Database. So this is truly linked (to) data.

Finally, this model is additive. Not only can our own, new additions be added by our curators and submitters, but any annotation based on our simple abstract model can be incorporated - including a whole database dump.

Have a play with the live demo at
Tony Burdett will be presenting this work at ISMB Technology Track on Sunday July 15th at 15:30.
This work is supported in part by EMBL and by the DBP project with NCBO via NIH.

No comments:

Post a Comment