Sunday, 9 December 2012

After the hype, the biology (and finally, the cancan)

I spent an enjoyable three days in Paris at this year’s SWAT4LS, meeting and talking with a lot of interesting people working in and around the semantic web community. Here are a few observations I made.

There were a lot of people.
It was my first time at this particular meeting and it struck me that there were a lot of people in attendance, somewhere around 100. I was surprised, though perhaps I should not be as the community has been steadily growing over the last few years. Still, this suggests this is no longer just a niche activity and has many more paid-up members than ever before. OK, I accept the fact it was in Paris might have swayed a few hearts and minds, but still.
They were cancaning in the aisles at Moulin Rouge
when they heard about all the RDF in town. 

There was a lot of interesting work…
I was pleasantly surprised at how good a lot of the work presented was and it spoke to much of what we have been doing over the past six months. To pick out one, I was impressed with a talk given by Matthew Hindle on data federation on Ecotoxicology using SADI in which he outlined the way they utilise the services to start to answer what I would call proper biological questions. It was a nice example of where they had been able to pick up some of the existing approaches and resources and apply them to their data and problems successfully and speaks to the notion that the field has matured some over the last few years. Of course I've read this claim numerous times in the past but largely from an anecdotal point of view; this was at least evidential that, to some extent, things exist that can be used to solve problems in the life sciences. There was inevitably some work to do of course, they did not find everything as an out of the box solution, but the components were there at least. I also liked the SPARQL R package that has been recently published and that we've been using in an MSc project with one of our students. It's been very useful  and we've written a package for analysing our Gene Expression RDF data in more intuitive ways, allowing simple R-like function calls to be made with the RDF, behind which the SPARQL lies, hidden. I think this sort of tool is important because it exposes the technology to an important, existing bioinformatic community in ways that aid and not hinder their work. We’ll be releasing this tool early next year. UniProt also presented their work on using rules within their RDF to detect inconsistencies within the annotations. I like this a lot and it's something we have also started exploring with our own RDF, such as looking for disease annotations that were made to ontology classes which were not subclasses of disease - we found a couple. This demonstrates nicely the advantage of having your data in a format that is native to the ontology. This enables one to ask meaningful ontology type questions (subclasses of x, classes part_of y, etc.) rather than having to formulate some hack between a database query and an OWL one and then do a textual comparison.

…but nothing that got me really, really excited
I didn't at any point have that epiphany that made me think "we're there". I've read much hyperbole of this sort far too often in publications (numbers of triples does not alone equate to good science), and at the moment it still remains just that. I do think it's maturing and I think this field is now becoming Important to those working in the life sciences. Certainly here, at EBI, we've been working a lot on RDF representations of data over the last six months and I see this from others too. But it doesn't underpin the basic science we do. It’s not that important to our databases, our curation, our applications and most importantly our users. It may become so - I think it probably will - but it's not there yet so, for now, read those aforementioned publications with skepticism.

There is much more to be done; opportunity, hard work and money
I see a lot of interesting opportunities in this area but, for my mind, there needs to be more engineering methods applied if we want to see less bespoke solutions occurring that live and die within the length it takes a paper to be written, accepted and published (in particular SPARQL end points). I've said this many times about engineering ontologies and I think it equally applies here. We need better update models – updating RDF once it's in a triple store seems to be too onerous a task at present. And documentation on how to do this stuff is really lacking. The HCLS W3C group that we have been working with have been having a go at writing a note on representing gene expression data in RDF but it's slow work and I still don’t know if we’re on the right lines. Which also suggests we need better ways of evaluating this stuff. What does it all mean once it's out there, in RDF? Can I use it, integrate with it and how? Most importantly, is it correct? It's not just about sticking a URI on everything and making it RDF – that's too simple if we want to use this in more computational approaches to analysing and solving real biological problems. One of the big problems I've found in this area is convincing funding agencies that these approaches can result in discoveries in biology when the evidence for this is conceptually strong but practically weak. As is often the case, we have a scenario in which waiting for discoveries before giving more funding will mean the discoveries never come, because the work is never funded. If I had a single want it would be some focused calls for this stuff across the likes of the BBSRC and EU Framework 7 with some specific biological objectives. There is hope on this front - Douglas Kell gave a plenary talk at last year's SWAT4LS so there is some recognition of the importance of this area.

My comrade in arms Simon Jupp gave a talk recently on the RDF work we have been doing (mainly Simon) which I added a subtitle to: ‘after the hype, the biology’. The hype may be fading, the biology may be surfacing but there is still much to do.

*Congratulation to Tomasz Adamusiak on winning an iPad in the Nanopublication competition. He celebrated Tomasz style; an evening at the Moulin Rouge.*