James Malone's EBI Blog: September 2012

A year in to the EBI's joint project with the NHGRI, the GWAS catalog and data visualization has become interactive with help from Semantic Web technologies.

In the fall of 2010, we started an informal collaboration with the NHGRI's Catalog of Genome Wide Association Studies (Hindorff et al, 2009) with our student, Paulo Silva. In that project, he rapidly created an ontology based on our ongoing EFO efforts to describe a lot of the traits in GWAS (see the NHGRI site for more on how the catalog is put together). Paulo deployed this using the existing ArrayExpress website infrastructure as a backbone to illustrate the benefits of using an ontology for curation and searching. Later that year, Paulo and Helen Parkinson presented this work to the NHGRI.

Figure 1. GWAS Catalog as of 2011. Much skilled work put this beautiful artefact
together. "My God, it's full of stars!"

Thankfully, they liked it and in October 2011 the EBI started a formal collaborative project with the NHGRI to improve the process of curation based on our expertise in using ontologies for annotation and in deploying them in applications. In addition, the famous and much-cited GWAS catalog diagram (see figure 1) was also to be given an overhaul in the way it was generated. The diagram illustrates GWAS traits associated with a SNP on a band on a human chromosome, and is generated manually by a skilled graphical artist every three months. If you take a look at the diagram you can quickly see that this is no easy task. New traits can appear (almost) anywhere each month and this can mean a lot of shuffling around of coloured dots to accommodate them. In addition, each trait has a unique colour which is described by a complex key (see figure 2) - complex because there are so many traits and therefore a lot of colours required. In fact, the harder the team work to curate this data, the harder it becomes to generate the diagram. The lazy person in me thought the answer was simple - stop working so hard!

GWAS Catalog key with many different colours for the many different traits

Figure 2. The key keeps growing as more work is included. Each of
these colours is unique. My favourite is #99FFFF

But they continue to work hard regardless and so a further issue presents itself; that searching the catalogue, either by the search interface or by viewing the GWAS image, is also hampered by the growing size and lack of structure to the trait names. A small list and this is not really a problem, but as the catalog has grown thanks to the curation efforts of the NHGRI, so have the amount of traits.

Motivation complete, let's look at the progress so far. You can click here to see the new diagram generated by the team at EBI* (please see below for the list of contributors). The first thing to note is that it is now an interactive, dynamic diagram. You can zoom in and out, you can mouse over traits and find out what they represent rather than have to look at the key. The default key has also been reduced down to higher level groupings, significantly reducing the number of colours used. These groupings correspond to ontology classes in the EFO which are superclasses of other traits and are generated based on maximum coverage (i.e. the 18 classes that covered the most amount of traits). One of the advantages of using an ontology emerge here; that you can begin to aggregate data together and get a better feel for global trends.

In addition, you can also drill down to more specific traits using the filter option. Because the diagram is dynamic, you can highlight only those traits of interest to you. Try entering 'cancer' into the 'Filter' box and press enter. The ontology is being used to show those SNPs annotated to cancer or subclasses of cancer. Now try 'breast cancer' and you can see a subset of these cancer trait dots remain highlighted. What about 'breast carcinoma'? Have a go. Clear the filters (click clear filter) and then enter 'breast carcinoma' and you'll see the same results as for breast cancer. Again, this is ontology at work; the browser is using synonyms stored in the ontology to perform the same query it does for 'breast cancer'. Simple, but very useful.

The benefit of generating the diagram programmatically is perhaps most evident in the time series view (click on the tab). A selection of diagrams from over the last seven years is shown here and all were generated with just a few clicks. This is in much contrast to the manual, hand-crafted artefacts that went before which took many weeks and much skill to produce. Seems almost unfair really.

The Techy Bit

So what's going on behind the scenes? The list of over 650 highly diverse traits, are mapped to EFO. These traits include phenotypes, e.g. hair colour, treatment responses, e.g. response to antineoplastic agents, diseases and more. Compound traits such as 'Type 2 diabetes and gout' are separated. Links between relevant traits are also made to facilitate querying (e.g. partonomy). It's also worth mentioning that EFO reuses (i.e. imports from) a lot of existing ontologies such as the Gene Ontology and Human Phenotype Ontology which also facilitates future integration.

The GWAS ontology is a small ontology used within the triple store to describe a GWAS experiment structure (not including traits) and describes: links between a trait and a SNP and the p-value of that association; where that SNP appears on a chromosomal band; and on which chromosome. Each chromosome is an SVG in which, the XML source has been tagged with ontology information, e.g. these coordinates of the SVG are band x. The dots representing traits are similarly assigned the trait type (from the ontology) within the class attribute of the SVG. These bits of information are used to render each trait on the images, and allows for the filtering on trait - and (in the very near future) on other properties such as the association p-value.

DL queries are also employed as part of this visualisation, making use of the OWL describing the two ontologies (EFO and GWAS ontology). The initial creation of the diagram uses a DL query which asks, amongst other things, which bands are on which chromosomes, which SNPs are on each band and which traits should be rendered where. Traits are selected using another DL query which, as well as location information, asks if they satisfy a certain p-value (which is a datatype property) - those below this threshold are not rendered. In the future this will become more dynamic so a user can specify their own p-value threshold.

These queries all take a while so they are all pre-computed when the diagram is built each release and then each SVG is cached on disk for quick retrieval. The actual filtering queries (which will also become auto-complete in the near future) also use a simple DL query; they ask for all the subclasses of a given trait class from EFO and these traits are passed back in a JSON object which is used to refresh the diagram, showing SVGs of this class. In the longer term, the aim is to use disk caching of the reasoner using something like Ehcache. Currently this is not possible due to some serialization issues with the version of the OWL-API that is being used but this is set to change. This will enable much more complex queries to be performed, utilising EFO, dynamically such as traits that may be the subject of a protocol of measurement or diseases that affect a particular system. There are many possibilities.

A side-effect of this project is that the technology is reusable for other projects and indeed we intend to utilise some of this for some of our current work in rendering the Gene Expression Atlas in RDF (more on that very soon). The model is relatively simple; describe your data in some schema (OWL ontology), work out what each ontology looks in SVG (the relationship between an ontology class and the image) and then render it.

This is all done using Semantic Web technologies; OWL, RDF, triple stores, ontologies, description logic reasoners. And you'd never know looking at the website. I've always thought this is the perfect example of when Sem Web comes good - it just works and the user never knows about the guts. This is the right way round I think; the biology should come first in this application, the technology is secondary. 99% of users won't really care that your website uses CSS or JavaScript the same should be true here. When Semantic Web technologies hit this level of brilliant but quiet, accepted ubiquity, we have reached our goal.

*List of contributors
The tooling for this project was developed by Danielle Welter and Tony Burdett, with input from Helen Parkinson, Jackie MacArthur and Joannella Morales at the EBI and the catalogue curation team at the NHGRI.

I've been looking at the Semantic Web Journal as we've a few bits of Sem Web work we're doing (which I'll discuss in a future blog) which I think is interesting and probably worth publishing (eventually) and I've seen a few interesting articles published there. My friend and colleague Phil Lord has oft talked /battered me into submission about the Semantics of Publishing - a slightly different take than a journal about the Semantic Web - but, nevertheless, relevant.

Phil's work has concentrated on adding value to publications using things like Greycite which gives a nice way of searching webpages for metadata and presenting a way of citing, for example, blog articles. This is extremely useful in a world where a journal publication is really only part of the modern online literature we read and might like to cite.

This is what I was hoping for deep below the surface of each
webpage. Beautiful, icy RDF and maybe a penguin or two.

I had hoped the Semantic Web Journal would take things a step further by providing Semantic Web style metadata behind the pages so I had a bit of trawl using Greycite and cURL to see what I could get back by analysing the HTML or perhaps any RDF through content negotiation. Sadly, I got nothing back other than HTML. In fact, not even any meta keywords were in the head of the HTML, and nothing back from cURL requests for RDF.

So I'm left a little disheartened; not annoyed, but disappointed. I think if you're going to lead the line by publishing articles on the Semantic Web, it would be good practise to add semantics to your Semantic Web journal, probably. I suppose, in return the onus is then us so-called practitioners to also follow the good example set. We've been doing this for a while for EFO but I confess my own website also lacks RDF behind it but I could be convinced to do so if there was a Linked Data framework that would add value to it. Ironically, this journal has some great articles on exactly these areas. The OntoBee paper talks about a service which does a lot of this stuff for ontology class and individuals in RDF and HTML at the same time and it all works rather well, but there are multiple other ways of doing this, and, as Greycite demonstrates, they need not be complex. Just useful.

There is one other thing I should add, which I really like about this journal; an open and transparent review process, posting all peer reviews on the journal website. This is one of my biggest gripes about science as it stands. I think the review process is flawed. I think reviews should be open and transparent for journals, conferences and grants.

James Malone's EBI Blog

Thursday, 13 September 2012

Bringing Genome Wide Associations to Life

The Techy Bit

Tuesday, 4 September 2012

Semantics of the Semantic Web Journal