Thursday, 1 August 2013

Thinking Small: RDFApps and Killing the Killer App

At ISMB in Berlin this year, Goncalo Abecasis gave a keynote in which he outlined the driver for all of his work; to enable more biologically significant questions to be asked of the data. I sat nodding as he summed up. I can't remember the exact wording so forgive my paraphrasing, but the message was along the lines of: just having data is not science, just having tools is not data analysis.

The next day I spoke at ISMB on some of the work +Simon Jupp and I have been doing using RDF for data integration. I made a similar point, in a much less profound way; just having RDF is not data integration and that integrating data is not a panacea (or any other type of Italian cake). I wanted to make it clear that I'm not an RDF zealot and that I don't think that it solves all our data integration or sharing problems. In fact, it creates problems of its own. RDF without a clear description of the schema or ontologies used is hard to understand and frankly even with the ontology it can still be hard. And of course, SPARQL which looks a bit like SQL, but isn't. And should we be exposing our core users to SQL anyway never mind SPARQL?
Think Small. Aim high. Don't get squashed.
Image courtesy of SweetCrisis /freedigitalphotos.net

Technology aside, Goncalo and I share a common goal. The work we've been doing is to enable richer, more precise, more biologically relevant questions to be asked of our Gene Expression data by exposing more of the meta data and putting it into wider contexts as requested by users. Some of the most common queries they wish to ask include signalling pathways, orthologs, drug targets and integration with other ontologies in the NCBO BioPortal. But with this additional richness comes the SPARQL problem.

Simon and I have spoken at length about how we see some of the RDF work we are doing and ultimately it is not about teaching all our biology focused users to SPARQL. The RDF+SPARQL layer should really be seen as an application interface, one that programmers and some bioinformaticians would code against in the same way they code against other web APIs. What I'd like to see the bioinformatics community embrace is the idea of RDFApps; focused pieces of software that solve a specific use case for the biomedical community. They don't need to be big, just useful.

We've been working on a few RDFApps that I'll blog about in the coming weeks. One such RDFApp we've been working on is an R package (which uses the great R SPARQL package) to perform querying of our Atlas RDF. The App essentially hides all of the SPARQL to the user but allows for some quite complex questions to be asked very simply. It also allows for some additional analysis such as over-representation analysis. We've also been experimenting with an App that does on-the-fly faceted browsing using Apache Solr and visualisation of the gene expression data.

I've become ever convinced over the last year or so that the idea of  the Semantic Web/RDF Killer App needs to die. We are in search of something that is essentially irrelevant. It is clear that there are things going in the community that are of much value, in and out of biology. GoodRelations is such an example - it uses lightweight semantics to attach some metadata to data on the web about products which search engines like Google can use to create Google Shopping - and Good Relations is used by over 10,000 other online stores. This is a small, focused application that has incredible power. We need to think small.

For us, thinking small means tools aimed at specific types of new analysis using specific, focused applications for the parts of biology that are of interest to a user and to a specific scientific question. RDF is a tool in this context, a tool that is clearly appropriate as it naturally fits for publishing data on the web, for integration with other data and for describing this data with ontologies. Forget Big Data, think Small Apps.

The Gene Expression Atlas RDF can be found at http://www.ebi.ac.uk/fgpt/atlasrdf