Sunday, 9 December 2012

After the hype, the biology (and finally, the cancan)

I spent an enjoyable three days in Paris at this year’s SWAT4LS, meeting and talking with a lot of interesting people working in and around the semantic web community. Here are a few observations I made.

There were a lot of people.
It was my first time at this particular meeting and it struck me that there were a lot of people in attendance, somewhere around 100. I was surprised, though perhaps I should not be as the community has been steadily growing over the last few years. Still, this suggests this is no longer just a niche activity and has many more paid-up members than ever before. OK, I accept the fact it was in Paris might have swayed a few hearts and minds, but still.
They were cancaning in the aisles at Moulin Rouge
when they heard about all the RDF in town. 

There was a lot of interesting work…
I was pleasantly surprised at how good a lot of the work presented was and it spoke to much of what we have been doing over the past six months. To pick out one, I was impressed with a talk given by Matthew Hindle on data federation on Ecotoxicology using SADI in which he outlined the way they utilise the services to start to answer what I would call proper biological questions. It was a nice example of where they had been able to pick up some of the existing approaches and resources and apply them to their data and problems successfully and speaks to the notion that the field has matured some over the last few years. Of course I've read this claim numerous times in the past but largely from an anecdotal point of view; this was at least evidential that, to some extent, things exist that can be used to solve problems in the life sciences. There was inevitably some work to do of course, they did not find everything as an out of the box solution, but the components were there at least. I also liked the SPARQL R package that has been recently published and that we've been using in an MSc project with one of our students. It's been very useful  and we've written a package for analysing our Gene Expression RDF data in more intuitive ways, allowing simple R-like function calls to be made with the RDF, behind which the SPARQL lies, hidden. I think this sort of tool is important because it exposes the technology to an important, existing bioinformatic community in ways that aid and not hinder their work. We’ll be releasing this tool early next year. UniProt also presented their work on using rules within their RDF to detect inconsistencies within the annotations. I like this a lot and it's something we have also started exploring with our own RDF, such as looking for disease annotations that were made to ontology classes which were not subclasses of disease - we found a couple. This demonstrates nicely the advantage of having your data in a format that is native to the ontology. This enables one to ask meaningful ontology type questions (subclasses of x, classes part_of y, etc.) rather than having to formulate some hack between a database query and an OWL one and then do a textual comparison.

…but nothing that got me really, really excited
I didn't at any point have that epiphany that made me think "we're there". I've read much hyperbole of this sort far too often in publications (numbers of triples does not alone equate to good science), and at the moment it still remains just that. I do think it's maturing and I think this field is now becoming Important to those working in the life sciences. Certainly here, at EBI, we've been working a lot on RDF representations of data over the last six months and I see this from others too. But it doesn't underpin the basic science we do. It’s not that important to our databases, our curation, our applications and most importantly our users. It may become so - I think it probably will - but it's not there yet so, for now, read those aforementioned publications with skepticism.

There is much more to be done; opportunity, hard work and money
I see a lot of interesting opportunities in this area but, for my mind, there needs to be more engineering methods applied if we want to see less bespoke solutions occurring that live and die within the length it takes a paper to be written, accepted and published (in particular SPARQL end points). I've said this many times about engineering ontologies and I think it equally applies here. We need better update models – updating RDF once it's in a triple store seems to be too onerous a task at present. And documentation on how to do this stuff is really lacking. The HCLS W3C group that we have been working with have been having a go at writing a note on representing gene expression data in RDF but it's slow work and I still don’t know if we’re on the right lines. Which also suggests we need better ways of evaluating this stuff. What does it all mean once it's out there, in RDF? Can I use it, integrate with it and how? Most importantly, is it correct? It's not just about sticking a URI on everything and making it RDF – that's too simple if we want to use this in more computational approaches to analysing and solving real biological problems. One of the big problems I've found in this area is convincing funding agencies that these approaches can result in discoveries in biology when the evidence for this is conceptually strong but practically weak. As is often the case, we have a scenario in which waiting for discoveries before giving more funding will mean the discoveries never come, because the work is never funded. If I had a single want it would be some focused calls for this stuff across the likes of the BBSRC and EU Framework 7 with some specific biological objectives. There is hope on this front - Douglas Kell gave a plenary talk at last year's SWAT4LS so there is some recognition of the importance of this area.

My comrade in arms Simon Jupp gave a talk recently on the RDF work we have been doing (mainly Simon) which I added a subtitle to: ‘after the hype, the biology’. The hype may be fading, the biology may be surfacing but there is still much to do.

*Congratulation to Tomasz Adamusiak on winning an iPad in the Nanopublication competition. He celebrated Tomasz style; an evening at the Moulin Rouge.*

Tuesday, 30 October 2012

Future Ontology - Five year predictions past and future

I made a number of ontology predictions five years ago to this very day. Here's a review of those and a few more for the next five years.

On 30th October 2007 I gave my first public presentation on the ontology work I'd been doing since starting work at EBI in May that same year. Today, 30th October 2012, I gave a talk during which I reflected back on things that have passed over those five years and, in order to do so, ended up looking at that old set of slides from 2007. It made for interesting reading. At the end of the talk I made some predictions about what we might do and where we might end up as a community. I thought it might be nice to share those now and to make a few more public, five year predictions. If I'm still working and the world has not ended in 2017 maybe I'll do it all again.
I also predicted that leather pants would make
a comeback. I was wrong on that one, thank God.

October 2007 - My Future Ontology Predictions


We will rely on and reuse external URIs in our work, rather than minting our own, as ontologies become more populous and stabilise.

This is certainly true of the work we do at EBI. Some ontologies, such as Gene Ontology, have been stable for quite a while and a couple of others have also followed a fairly stable road to persisting ontology URIs over time (including our own EFO which follows the GO's practice). What I really wanted to see was that once a URI is minted in an ontology it persists unless there is a very good reason for it to go away. Many more bio-ontologies do this than used to and in some ways, this is a measure of maturity of the community.

We will add dereferencable URIs for our data and put metadata behind it.

Partially. This happens for most ontologies now which is a definite step in the right direction. For example, OntoBee does a lot of content negotiation for OBO ontologies which is nice. The data side still lags behind but we are looking into that internally now. I suppose overall the data part is less true than I would have wished but this is a classic chicken and egg. The promise comes later, when everyone does it, but until everyone does it, it is not immediately obvious why you should. We need to be bold.

As ontology numbers increase and overlap, mapping between them and data described with them, will become our biggest challenge.

I think this is true and I am still concerned by it. Where this is not true is that there is not a huge amount of data published using all of these ontologies as I perhaps envisaged in 2007. But I still maintain that when more is published, the mapping problem will be difficult. Having said all of that, I am also unconvinced that building one ontology for each domain by attempting to get all communities to agree to every definition of every biological concept is the answer either. I've seen ontologies grind to a halt over analysis-paralysis over the last five years and this is also not the way to go. Sacrificing one critical problem for another is not a good solution.

Agent technology will help in our mappings and in the way we discover data.
Pretty much didn't happen - in bioinformatics anyway. But I think this was because I was overly-optimistic about how much of the infrastructure would exist in this semantic web world. It is worth though that Google's agents (web crawlers) do use rich snippets tags which includes an RDFa version about products on web pages to help populate the 'shopping' search you see. So I was a bit off the mark but not completely.

That biological triples would be championed by all.
I think this has definitely been wrong up until very recently and in some ways I am guilty of being sucked in by the hype - by the promise that integrating all of our RDF data would bring. This year though, the EBI has started an RDF Frontier group to trial this work. You can see the work Simon and I have be doing at our FGPT Atlas RDF page to see how this is progressing - well so far I'd say. I was a bit premature on this prediction. Which leads me on nicely to...

October 2012 - My future ontology predictions for the next five years


The number of reference ontologies will level off (and some will disappear) and natural 'winner' will emerge.
I make a deliberate distinction here by saying 'reference' ontologies as I think the number of ontologies put together for applications will likely increase but they are unlikely to be considered as references for a domain. I think funding for building these reference ontologies will fall and some may even become moribund sadly. But a lot of the important ones will live on and continue to develop. The natural winners - ontologies that become the de facto choice for a domain will emerge. We will need to find ways of using these ontologies that does not necessitate building a whole new reference ontology.

Upper ontologies will play a less important role in the community.
Some might say 'about time' but I do think they've had a role to play and have helped with some things even if the approach of those involved has been, shall we say, less than endearing. But I think their domination in every discussion about whether an ontology is 'good' and how an ontology fits into an upper ontology will decline in favour of focusing on how we can use the ontologies to describe our data and do biology.

Use of ontologies and semantic web technologies in Bioinformatics will become ubiquitous.
This is ambitious but I think it should and will happen. I'm convinced, even from some of the early prototyping work we've been doing, there is enough data out there now that warrant applications for biologists to use.

Publishers will curate literature using ontologies and make the API to these annotations public.
I think the great work that happens already in GoPubMed should happen for more ontologies and for more publishers. It's just great. For more on this I refer you to Phil Lord who wages a one-man war for more semantics in publications (amongst other improvements in the industry).

Google will endorse the semantic web.
Or they will at least admit that it's useful. They already use semantics this with rich snippets. I'd like to see them support this area of web science and I think they, eventually, will. If they outwardly endorsed this, then who knows, certainly I think more people would use semantics when publishing their data on the web.

I'd love to hear more from those in the community who are willing to stick a stake in the ground.

Thursday, 13 September 2012

Bringing Genome Wide Associations to Life

A year in to the EBI's joint project with the NHGRI, the GWAS catalog and data visualization has become interactive with help from Semantic Web technologies.

In the fall of 2010, we started an informal collaboration with the NHGRI's Catalog of Genome Wide Association Studies (Hindorff et al, 2009) with our student, Paulo Silva. In that project, he rapidly created an ontology based on our ongoing EFO efforts to describe a lot of the traits in GWAS (see the NHGRI site for more on how the catalog is put together). Paulo deployed this using the existing ArrayExpress website infrastructure as a backbone to illustrate the benefits of using an ontology for curation and searching. Later that year, Paulo and Helen Parkinson presented this work to the NHGRI.
GWAS Catalog Diagram 2011
Figure 1. GWAS Catalog as of 2011. Much skilled work put this beautiful artefact
together. "My God, it's full of stars!"

Thankfully, they liked it and in October 2011 the EBI started a formal collaborative project with the NHGRI to improve the process of curation based on our expertise in using ontologies for annotation and in deploying them in applications. In addition, the famous and much-cited GWAS catalog diagram (see figure 1) was also to be given an overhaul in the way it was generated. The diagram illustrates GWAS traits associated with a SNP on a band on a human chromosome, and is generated manually by a skilled graphical artist every three months. If you take a look at the diagram you can quickly see that this is no easy task. New traits can appear (almost) anywhere each month and this can mean a lot of shuffling around of coloured dots to accommodate them. In addition, each trait has a unique colour which is described by a complex key (see figure 2)  - complex because there are so many traits and therefore a lot of colours required. In fact, the harder the team work to curate this data, the harder it becomes to generate the diagram. The lazy person in me thought the answer was simple - stop working so hard!
GWAS Catalog key with many different colours for the many different traits
Figure 2. The key keeps growing as more work is included. Each of
these colours is unique. My favourite is #99FFFF

But they continue to work hard regardless and so a further issue presents itself; that searching the catalogue, either by the search interface or by viewing the GWAS image, is also hampered by the growing size and lack of structure to the trait names. A small list and this is not really a problem, but as the catalog has grown thanks to the curation efforts of the NHGRI, so have the amount of traits.

Motivation complete, let's look at the progress so far. You can click here to see the new diagram generated by the team at EBI* (please see below for the list of contributors). The first thing to note is that it is now an interactive, dynamic diagram. You can zoom in and out, you can mouse over traits and find out what they represent rather than have to look at the key. The default key has also been reduced down to higher level groupings, significantly reducing the number of colours used. These groupings correspond to ontology classes in the EFO which are superclasses of other traits and are generated based on maximum coverage (i.e. the 18 classes that covered the most amount of traits). One of the advantages of using an ontology emerge here; that you can begin to aggregate data together and get a better feel for global trends.

In addition, you can also drill down to more specific traits using the filter option. Because the diagram is dynamic, you can highlight only those traits of interest to you. Try entering 'cancer' into the 'Filter' box and press enter. The ontology is being used to show those SNPs annotated to cancer or subclasses of cancer. Now try 'breast cancer' and you can see a subset of these cancer trait dots remain highlighted. What about 'breast carcinoma'? Have a go. Clear the filters (click clear filter) and then enter 'breast carcinoma' and you'll see the same results as for breast cancer. Again, this is ontology at work; the browser is using synonyms stored in the ontology to perform the same query it does for 'breast cancer'. Simple, but very useful.

The benefit of generating the diagram programmatically is perhaps most evident in the time series view (click on the tab). A selection of diagrams from over the last seven years is shown here and all were generated with just a few clicks. This is in much contrast to the manual, hand-crafted artefacts that went before which took many weeks and much skill to produce. Seems almost unfair really.

The Techy Bit


So what's going on behind the scenes? The list of over 650 highly diverse traits, are mapped to EFO. These traits include phenotypes, e.g. hair colour, treatment responses, e.g. response to antineoplastic agents, diseases and more. Compound traits such as 'Type 2 diabetes and gout' are separated. Links between relevant traits are also made to facilitate querying (e.g. partonomy). It's also worth mentioning that EFO reuses (i.e. imports from) a lot of existing ontologies such as the Gene Ontology and Human Phenotype Ontology which also facilitates future integration.

The GWAS ontology is a small ontology used within the triple store to describe a GWAS experiment structure (not including traits) and describes: links between a trait and a SNP and the p-value of that association; where that SNP appears on a chromosomal band; and on which chromosome. Each chromosome is an SVG in which, the XML source has been tagged with ontology information, e.g. these coordinates of the SVG are band x. The dots representing traits are similarly assigned the trait type (from the ontology) within the class attribute of the SVG. These bits of information are used to render each trait on the images, and allows for the filtering on trait - and (in the very near future) on other properties such as the association p-value.

DL queries are also employed as part of this visualisation, making use of the OWL describing the two ontologies (EFO and GWAS ontology). The initial creation of the diagram uses a DL query which asks, amongst other things, which bands are on which chromosomes, which SNPs are on each band and which traits should be rendered where. Traits are selected using another DL query which, as well as location information, asks if they satisfy a certain p-value (which is a datatype property) - those below this threshold are not rendered. In the future this will become more dynamic so a user can specify their own p-value threshold.

These queries all take a while so they are all pre-computed when the diagram is built each release and then each SVG is cached on disk for quick retrieval. The actual filtering queries (which will also become auto-complete in the near future) also use a simple DL query; they ask for all the subclasses of a given trait class from EFO and these traits are passed back in a JSON object which is used to refresh the diagram, showing SVGs of this class. In the longer term, the aim is to use disk caching of the reasoner using something like Ehcache. Currently this is not possible due to some serialization issues with the version of the OWL-API that is being used but this is set to change. This will enable much more complex queries to be performed, utilising EFO, dynamically such as traits that may be the subject of a protocol of measurement or diseases that affect a particular system. There are many possibilities.

A side-effect of this project is that the technology is reusable for other projects and indeed we intend to utilise some of this for some of our current work in rendering the Gene Expression Atlas in RDF (more on that very soon). The model is relatively simple; describe your data in some schema (OWL ontology), work out what each ontology looks in SVG (the relationship between an ontology class and the image) and then render it.

This is all done using Semantic Web technologies; OWL, RDF, triple stores, ontologies, description logic reasoners. And you'd never know looking at the website. I've always thought this is the perfect example of when Sem Web comes good - it just works and the user never knows about the guts. This is the right way round I think; the biology should come first in this application, the technology is secondary. 99% of users won't really care that your website uses CSS or JavaScript the same should be true here. When Semantic Web technologies hit this level of brilliant but quiet, accepted ubiquity, we have reached our goal.

*List of contributors
The tooling for this project was developed by Danielle Welter and Tony Burdett, with input from Helen Parkinson, Jackie MacArthur and Joannella Morales at the EBI and the catalogue curation team at the NHGRI.

Tuesday, 4 September 2012

Semantics of the Semantic Web Journal


I've been looking at the Semantic Web Journal as we've a few bits of Sem Web work we're doing (which I'll discuss in a future blog) which I think is interesting and probably worth publishing (eventually) and I've seen a few interesting articles published there. My friend and colleague Phil Lord has oft talked /battered me into submission about the Semantics of Publishing - a slightly different take than a journal about the Semantic Web - but, nevertheless, relevant.

Phil's work has concentrated on adding value to publications using things like Greycite which gives a nice way of searching webpages for metadata and presenting a way of citing, for example, blog articles. This is extremely useful in a world where a journal publication is really only part of the modern online literature we read and might like to cite.

This is what I was hoping for deep below the surface of each
webpage. Beautiful, icy RDF and maybe a penguin or two.
I had hoped the Semantic Web Journal would take things a step further by providing Semantic Web style metadata behind the pages so I had a bit of trawl using Greycite and cURL to see what I could get back by analysing the HTML or perhaps any RDF through content negotiation. Sadly, I got nothing back other than HTML. In fact, not even any meta keywords were in the head of the HTML, and nothing back from cURL requests for RDF.

So I'm left a little disheartened; not annoyed, but disappointed. I think if you're going to lead the line by publishing articles on the Semantic Web, it would be good practise to add semantics to your Semantic Web journal, probably. I suppose, in return the onus is then us so-called practitioners to also follow the good example set. We've been doing this for a while for EFO but I confess my own website also lacks RDF behind it but I could be convinced to do so if there was a Linked Data framework that would add value to it. Ironically, this journal has some great articles on exactly these areas. The OntoBee paper talks about a service which does a lot of this stuff for ontology class and individuals in RDF and HTML at the same time and it all works rather well, but there are multiple other ways of doing this, and, as Greycite demonstrates, they need not be complex. Just useful.

There is one other thing I should add, which I really like about this journal; an open and transparent review process, posting all peer reviews on the journal website. This is one of my biggest gripes about science as it stands. I think the review process is flawed. I think reviews should be open and transparent for journals, conferences and grants.

Thursday, 26 July 2012

Why choosing ontologies should not be like choosing Pepsi or Coke

I've just returned from the International Conference on Biomedical Ontology (ICBO) which is the biggest conference in the field of bio-ontology. I was invited to sit on a panel which was somewhat provocatively titled 'How to deal with sectarianism in biomedical ontology' during which we discussed how we might better get along and defuse some of the issues that have plagued the community over the last decade or so.

Fish-Tree-Pepsi-Coke. I know what you're thinking but
just go with me on this and read the post.
The range of views of the panelists was interesting, in part because they were not as extreme as one might have expected, or at least as they might have been five years ago. I'll attempt to summarise the panelists thoughts based on their initial two minute 'pitch' slide, I've included a link to the participants slide:

Alan Rector - Alan mentioned the need for humility and to understand what a given ontology is designed to do before we criticise it as they are can be made for different purposes. He also mentioned the need for proper evaluation.

Chris Stoeckert - Chris stated that sectarianism is inevitable and that he had chosen his sect which was BFO/realism. Ultimately, he said the biggest sect wins and that this is the OBO Foundry, which, as a community effort, we should join.

Barry Smith - Barry suggested that any ontology of any worth should be developed by an ontologist that has signed up to a 'code of ethics' which includes principles of reuse, aggressive testing in multiple real world applications and of  'thinking first' before adding a term or definition.

My own stance was that in general, I don't think a sectarian approach is very useful, not only because it causes political divides within our community, but because it also alienates us from other communities who, from the outside looking in, may be less likely to engage with us. And that hurts us because above all else we need users, more than they need us. I also think competition is fine. This is in general how science has worked for quite some time, moreover, if it didn't then we would never have made leaps forward by listening to the minority voices on issues such as evolution and Copernican heliocentrism. 

But underlying everything I said is my desire to see ontology engineering become a first class citizen and mature as a discipline. My job, in part, entails building ontologies for millions of data points with much diversity; 1,000 species, 30,000 experiments, 250,000 unique annotations. If people are willing to call out that I should be using ontology a instead of ontology b, then I need to know why, and this can not be based on subjective or political opinions. I want to see the development of formal, objective metrics to determine whether or not one ontology is better than another, so that we can really measure these artifacts and have something scientific to base our judgements on.

Alan Rector also rightly points out ontologies are built for different purposes so we need to factor that in. As Einstein said "if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid." If Amazon used an ontology to power their website, it would be hard to argue that particular fish is not a good artifact as the Amazon application seems to work pretty well.

I've also heard many comments from certain quarters about an 'ontology crisis' wherein ontologies of poor quality are now everywhere to be seen, polluting the pool. This sort of comment is similar to comments made during the software crisis of the 1960s, and, given that funding for ontologies can be hard to come by, we can ill afford to overrun. They reacted to this by developing software engineering processes and methods which, over time, helped enormously, though they did not resolve all the issues, cf. no silver bullet. Whatever your stance, it is hard to argue against wanting proper processes and methods for building in quality; nobody wants a blue screen of death on a plane's fly-by-wire system during a transatlantic flight. Similarly, nobody wants a medical system using an ontology to give incorrect results. An ontology for your photo collection, we care less so.

So what do we need? Here's my list;
  1. A formal set of engineering principles for systematic, disciplined, quantifiable approach to the design, development, operation, and maintenance of ontologies
  2. The use of test driven development, in particular using sets of (if appropriate, user collected) competency questions which an ontology guarantees to answer, with examples of those answers - think of this as similar to unit testing
  3. Cost benefit analysis for adopting frameworks such as upper ontologies, this includes aspects such as cost of training for use in development, cost to end users in understanding ontologies built using such frameworks, cost benefits measured as per metrics such as those above (e.g. answering competency questions) and risk of adoption (such as significant changes or longer term support).
In a sentence; making public judgements on ontologies should be a formal, objective and quantifiable process and less like deciding whether you'd prefer a Pespi or a Coke.

Incidentally, I prefer Coke.