James Malone's EBI Blog

Thursday, 13 June 2013

Big Data - understanding what you see is more important than simply seeing it

I read an article in Nature published today titled Biology: The big challenges of big data. It was interesting in a technical computer science sort of way but, for me, it omitted what I see as the biggest challenge we face: that simply seeing Big Data is not as important as understanding it..

It ismore important to understand what you see than it is to simply see it.
The red lines are called the Hindenburg Omen, a pattern used to identify
potentially large stock market falls. The early 2007 crash warning is clearly
visible in the centre and to the far right the credit crunch is about to hit.
One was also seen this month (June 2013). Image: Ian Woodward

Our group at EBI primarily works with what used to be called big data, what you might now call Medium Data™- it's big but it's not Big Data big (though it is getting bigger). Confused? Good. One of the key tasks we undertake is to 'add value' to the data that is submitted to the likes of ArrayExpress and then processed into the Gene Expression Atlas or the BioSample Database. Adding value takes many forms, but primarily it's about making sure the data is internally consistent within the experiment and then trying to make it outwardly consistent with the rest of the experimental data we host. We use ontologies as part of this alignment, as well as resources like ENSEMBL, BioMart, and others. It's a Big Job (and a difficult one). For me, the Big Job we do at EBI has always been one of adding adding value. The EBI is not just the world's hard drive and nor should it be.

The article describes the technical challenges of Big Data in some detail; the role of cloud computing, security, legitimacy of data sources, analysis tools, etc. I missed what I consider to be the biggest challenge in Big Data - the part about how you actually make sense of the massive quantities you're faced with. Larry Hunter comes closest in the article when he says "getting the most from the data requires interpreting them in light of all the relevant prior knowledge."

Data Sharing has increasingly become a misnomer to me. The point of sharing is, presumably, so others can reproduce or reuse. However, the intention of making data available to others (sharing) is somewhat redundant if the end user can't actually use it because they can't understand it. A previous Nature article reflected on the practice of data sharing and that reproducing results was rarely possible because of a lack of detail accompanying the data. With Big Data this problem only gets, well, Bigger.

An amusing YouTube cartoon circulated recently which perfectly captured many of the issues that I think are salient. The idea of having USB drives posted to one another will be impossible in the Big Data world of course, these are the technical issues the Nature article points to. What remains the same is the issue of understanding what it is you're trying to use: how it was produced, formatting, variables named, etc.

Big Data requires Big MetaData. The scope of new technologies means we can capture much more detail about many more biological entities. The Nature Reviews Genetics article by Nekrutenko highlight that "very few current studies record exact details of their computational experiments, making it difficult for others to repeat them."

Hindenburg Omen or not, I fear that we are entering a decade of Big Disappointment if we don't address the issues of how we describe the data we are sharing in more formal, rich and meaningful ways and do so earlier rather than when it is too late. It is already too late for much of the data that has already been 'shared'. The irony is, of course, that this existing data may already hold many of the answers we are looking for but will never be found simply because we can't reuse it. I'd rather have Small Data I Can Understand then Big Data I can't. Repeating previous errors when sharing data would be the Biggest Mistake Of All.

Thursday, 16 May 2013

My NCBO Webinar

JJ Abrams was also on the call but I told him I was
too busy with ontologies no matter how much he begged.

I could be found speaking on a live Webinar last night which has now been made available online for anyone interested. The talk is about 40 minutes long - I do drone on a bit it seems but we had a lot to talk about. Hopefully the content is interesting. The talk covers the work Simon Jupp (primarily) and I have been doing as one of NCBO's Driving Biological Projects. I've blogged about some of the topics I spoke about, but the main components I spoke about were:

Our resources (briefly)

How we rapidly develop our ontologies

How our curators use a tool we developed called Corona to annotate data in the Gene Expression Atlas

Our Phenotator tool, developed to allow cell biologists to develop a cellular phenotype ontology that covers their data without having to understand OWL or the various reference ontology nuances.

Zooma, a knowledge base of curator ontology annotations for automatically annotating data

Our new RDF Atlas work, including web UI allowing new querying over the Atlas and other data resources we've integrated our data with, such as Reactome pathways

Our new Atlas RDF-R package (not yet public) which wraps SPARQL into nice convenience functions in R and includes an enrichment package for use against the Atlas. I'm working on a new version of this which build on the work my student Maryam Soleimani and I did in a prototype and will try and release it as soon as possible. I'll blog when I do.

Enjoy The Science.

Monday, 29 April 2013

Keeping it Agile: the secret to a fitter ontology in 4* easy** steps!

*there are probably more than 4
**it's not all that easy

I've been preachy recently in complaining about how the ontology world doesn't apply enough software engineering practices in producing ontologies. I thought it was about time I explained some of the things I thought they could do by talking specifically about the things we do here to help us. There's an expanded version of this in a paper accepted for 2013 OWLED workshop for those attending.

1. Whatcha gonna do?

First thing we steal from software engineering is our overall methodology. I have talked a bit about this previously at ICBO 2012 where I presented on how we applied Agile Software Engineering Methods to the development of the Software Ontology. There are a few things this gives us. It helps us prioritise development. Collecting requirements is not usually a problem - there are always bucket loads. As with most projects, there is always more work than people and we need to focus on the things that are most important - which can change month to month.

The red stuff means we're doing it right (that is,
we're catching the stuff we're doing wrong early).

We use a few agile methods to help with this. Priority poker and buy-a-feature have been of particular use when engaging with users and also reasonably fun to do. It also helps keep our major stakeholders involved with the process of development, which is useful because it means there are no big surprises at end of each sprint (i.e. cycle of development). This way everyone knows what we're gonna do and so do we.

2. Building a house with bricks on your back

One of the primary ontology I'm currently involved with developing is the Experimental Factor Ontology. EFO is an application ontology - that is to say it is built to serve application use cases, distinct from a reference ontology which are built as a de facto reference for a given domain. When building EFO we try to reuse as many reference ontologies as we deem suitable (I won't expand on what this means here). But needless to say, our reliance on external resources introduces a coupling - in the same way it does in software projects using libraries. I often refer to this as trying to build a house with the bricks strapped to your back; nice to know you have them close by, but they're heavy. We have some importing code we use to help us manage these imports, based on MIREOT. This still gives us issues to look out for. For example, there is much variation in annotation property names, for example for 'synonyms', so we need to merge these so our applications know where to find them. Where imports are not possible or suitable, we mint new EFO classes. Since multiple developers from various sites can mint these, we have built some tooling for central URI management to avoid clashes which could otherwise easily occur. URIGen is this tool - see my previous blog for more on this.

To keep track of external changes and to produce our release notes we use our Bubastis tool which does a simple syntactic diff across two ontologies to tell you what's changed, been added and been deleted. Keeping track of what's going on externally is a complicated process and brings baggage with it. There is a discussion to be had as to when the balance of keeping track introduces an unacceptable overhead as you are effectively at the mercy of external developers. Examples of changes we've had to deal with include: upper ontology refactoring, mass URI refactoring, funding ending, general movement of classes, changes to design patterns (and axiomatisation therein) and so on. For what it's worth I think we're in a better place now than we started building EFO five and a bit years ago, although my opinion on this will change if the new (non-backwards compatible) BFO temporal relations are adopted.

3. Test Driven Development

Another agile process we adopt is test driven development. In a continuous integration framework, it is necessary to test each commit of code to ensure that it does not break previously working components and introduce new bugs and we treat OWL with the same respect. We have developed a series of automated tests using Bamboo that the ontology is ran against after each commit which performs checks such as for: invalid namespaces; IRI fragments outside accepted conventions; duplicate labels between different classes; synonyms duplicated between classes; obsolete classes used in axiomatisation; unit tests for expected class subsumption (e.g. cancer should be subclass of disease).

4. Design Patterns

Another aspect is performance and the OWL DL profile we restrict to. In order to fully exploit the querying power of the ontology, we use reasoning to infer various hierarchies of interest, such as classifications of cell lines by disease and species, and we need this to happen in a time that is responsive. There are several methods we use to ensure this remains the case. The first is the use of design patterns. We restrict axiomatisation to a set of patterns that we have developed to answer our priority competency questions. The second is to disallow the addition of new object properties and characteristics on those properties. The third is to classify the ontology on every commit (and run the above test code). HermiT gives us best performance for this interested in reasoning and has done for quite some time now.

We also employ an automated release cycle to release a new version of EFO monthly, in order to best coordinate with our application needs. The release is programmatically performed using a Bamboo build plan which performs tasks such as creating the inferred version of the ontology, converting the ontology to
OBO format, publishing files to the web, building the EFO website and creating URLs for classes in the EFO namespace to ensure that concepts described in EFO fully dereference.

Agility, reality, profanity

Our overall approach has improved the production quality immensely over the last few years. To quantify this with an example: over our last 3 months of work, 74% of the time our EFO continuous integration testing has passed on check in. This means that 26% of the time it has not. Although this sounds like a bad thing it's actually good to know we're catching this now before it goes for release to applications. Much of this is actually relatively minor stuff like white space in labels which we are fairly strict on but sometimes it's more serious stuff that we're glad we caught.

We've also become more dynamic in prioritising and sharing tickets meaning more important stuff gets done more quickly and by a number of people, tickets being picked off the top of the priority pile as people become available.

We still struggle with a few things and these are challenges that hit most ontology consumers I think. The biggest is balancing correctness with 'doing something'. This is a tricky brew to get right as we don't want the ontology to be wrong, but we do want to get things out and working as quickly as possible. Thinking about the metaphysical meaning of a term over a period of months does not help when you have data to annotate covering 1,000 species and 250,000 unique annotations as your target; this is the reality we face. In the same breath though, getting things very wrong doesn't provide the sort of benefits you want from using an ontology - and using an ontology adds an overhead so there should be benefits.

There is a dirty word in the ontology world that most dare not utter, but we do so here; 'compromise'. We do stuff, if it's wrong we fix it, we release early, release often and respond rapidly to required changes from users. Sound familiar?

Monday, 4 February 2013

When will bio-ontologies grow up (and how will we know)?

Robert Stevens and I have published an experiment we did on evaluating levels of activity in bio-ontologies over the last decade. It felt like a decade ago since we did the work such was the delay by the journal in getting it out. Here's a summary of the full paper.

At ICBO 2011 in Buffalo, NY, Robert Stevens and I were chatting outside my Hotel about which ontologies we use in our work and how one makes a choice. A few others there - I recall +Melanie Courtot and +Frank Gibson were also present - also had thoughts. There was a collective wisdom about ontology maturity, development and engineering in what we said and we felt it probably would change the landscape of this area of research forever. I wish I had been able to remember any of it.

Nevertheless, Robert and I went ahead and performed a bit of work looking at one aspect of ontology evaluation to see if we could glean some insights into the constitution of the various bio-ontologies in existence and see how far we had come. We limited our work to looking at what we called 'activity'. Some of the research questions we wanted to investigate were:

How frequently is an ontology updated?
What do these changes look like?
Who makes these changes?
Is there attribution for changes?
Can we see patterns (profiles) as ontologies mature?

"Ontology activity" by Aureja Jupp.

Our method for doing this was relatively simple. Firstly, find the available ontology repositories. Secondly download the ontologies and record data about it - date, commiter etc. Thirdly, perform a syntactic diff between subsequent versions looking at number of classes added, deleted and that have add axiomatic changes (for example, have a new parent class, or part of assertion made on them). Finally, perform a bit of analysis on these results.

Activity and the Super-Ontologist

We performed the diff using a tool I had written several years ago now called Bubastis - there's also a Java library available since we did this work. The tool is fairly simple; it uses the OWL-API to read in OWL or OBO ontologies and performs a class-level set difference on axiom assertions. It also looks for newly declared named classes and similarly for named classes missing in previous versions.

I'm not going to go into everything here, you can read the paper for all the details, but here's a few of the interesting things we found.

1. Most activity in the community is in refining classes that already exist within an ontology. Alongside this, we also found that a lot of classes were deleted between versions which is in contrast to the perceived wisdom that classes are only made obsolete and not removed. It is arguable that for the OBO ontologies this is less of a crime; when we look at the details we can see some of these deletions are caused by name space changes between versions with the ID fragment at the end (e.g. the GO_1234567 bit) not changing. Nevertheless, this is a problem if one chooses to use OWL versions or use the full URIs for these ontologies.

2. Between 2009 and 2011 ontology activity remained fairly constant. We produced a metric by totalling all the changes and compared the two using a paired t-test which suggests that levels have not changed significantly.

3. Active ontologies tend to release often. This is perhaps not surprising to anyone familiar with software practices of releasing early and often, but was good to see. The perceived wisdom in software engineering is that this allows for rapid feedback from and response to users - something active ontologies have adopted.

4. Some ontologies may be dead. Dormancy may suggest the ontology is inactive or complete. The Xenopus anatomy ontology is currently inactive and is more likely to be a case of completeness rather than death. Nevertheless, monitoring when an ontology becomes moribund is almost certainly a worthwhile endeavor for efforts such as the OBO Foundry since an ontology could end up occupying an area, thereby preventing progress.

5. Lots of committers does not always lead to lots of activity. There are several factors to consider here. Firstly, a few of the project use automated bots to release the ontology so the committer name is not a good indicator of the number of editors. Nevertheless, there are some projects which contain many different committers which have low levels of activity and vice versa. This may be suggestive of many things - that large collaborate efforts suffer from "consensus paralysis" - but it may also suggest that tracking tacit contributions is difficult. Which raises other issues, namely of appropriate credit for such contributions. We arrived at more questions than answers with this one.

6. There lives amongst us a Super-Ontologist. The committer 'girlwithglasses' has made a total of 500 ontology revisions spanning 13 different ontology projects; She is truly the Ontological Eve.

Discussions

We had many discussions when writing this work up as to what it all meant, and we've put most of this into the paper - if you're interested you'd be best of heading there. Perhaps, more than anything, the conclusion I was drawn to from all of this work was that ontology engineering is still immature. We leaned towards software engineering when we undertook this work; maturity profiles, diff analysis on code revisions, etc. but we were feeling our way towards what might be a good way of summarising this aspect of what you might call ontology quality. Software and ontologies are the same in many aspects but different in many others but I still think this is where our best hope lies in applying QC.

I've previously discussed why the lack of ontology engineering practices is a problem and I think we need more quantifiable approaches for developing and evaluating ontologies. In a workshop I attended in September one of the top desires of the biologist user community was advice on what ontologies they should use. I'm always tempted to name the ontologies I know and use. When we first started using ontologies we performed an analysis of coverage across the various ontologies to work out which would offer us most. Coverage is one metric - a simple but important one - but now there are so many ontologies which offer coverage, we need more than this to inform our decision. Our ontologies are growing up. We should probably help to point them in the right direction.

Sunday, 9 December 2012

After the hype, the biology (and finally, the cancan)

I spent an enjoyable three days in Paris at this year’s SWAT4LS, meeting and talking with a lot of interesting people working in and around the semantic web community. Here are a few observations I made.

There were a lot of people.

It was my first time at this particular meeting and it struck me that there were a lot of people in attendance, somewhere around 100. I was surprised, though perhaps I should not be as the community has been steadily growing over the last few years. Still, this suggests this is no longer just a niche activity and has many more paid-up members than ever before. OK, I accept the fact it was in Paris might have swayed a few hearts and minds, but still.

They were cancaning in the aisles at Moulin Rouge
when they heard about all the RDF in town.

There was a lot of interesting work…

I was pleasantly surprised at how good a lot of the work presented was and it spoke to much of what we have been doing over the past six months. To pick out one, I was impressed with a talk given by Matthew Hindle on data federation on Ecotoxicology using SADI in which he outlined the way they utilise the services to start to answer what I would call proper biological questions. It was a nice example of where they had been able to pick up some of the existing approaches and resources and apply them to their data and problems successfully and speaks to the notion that the field has matured some over the last few years. Of course I've read this claim numerous times in the past but largely from an anecdotal point of view; this was at least evidential that, to some extent, things exist that can be used to solve problems in the life sciences. There was inevitably some work to do of course, they did not find everything as an out of the box solution, but the components were there at least. I also liked the SPARQL R package that has been recently published and that we've been using in an MSc project with one of our students. It's been very useful and we've written a package for analysing our Gene Expression RDF data in more intuitive ways, allowing simple R-like function calls to be made with the RDF, behind which the SPARQL lies, hidden. I think this sort of tool is important because it exposes the technology to an important, existing bioinformatic community in ways that aid and not hinder their work. We’ll be releasing this tool early next year. UniProt also presented their work on using rules within their RDF to detect inconsistencies within the annotations. I like this a lot and it's something we have also started exploring with our own RDF, such as looking for disease annotations that were made to ontology classes which were not subclasses of disease - we found a couple. This demonstrates nicely the advantage of having your data in a format that is native to the ontology. This enables one to ask meaningful ontology type questions (subclasses of x, classes part_of y, etc.) rather than having to formulate some hack between a database query and an OWL one and then do a textual comparison.

…but nothing that got me really, really excited

I didn't at any point have that epiphany that made me think "we're there". I've read much hyperbole of this sort far too often in publications (numbers of triples does not alone equate to good science), and at the moment it still remains just that. I do think it's maturing and I think this field is now becoming Important to those working in the life sciences. Certainly here, at EBI, we've been working a lot on RDF representations of data over the last six months and I see this from others too. But it doesn't underpin the basic science we do. It’s not that important to our databases, our curation, our applications and most importantly our users. It may become so - I think it probably will - but it's not there yet so, for now, read those aforementioned publications with skepticism.

There is much more to be done; opportunity, hard work and money

I see a lot of interesting opportunities in this area but, for my mind, there needs to be more engineering methods applied if we want to see less bespoke solutions occurring that live and die within the length it takes a paper to be written, accepted and published (in particular SPARQL end points). I've said this many times about engineering ontologies and I think it equally applies here. We need better update models – updating RDF once it's in a triple store seems to be too onerous a task at present. And documentation on how to do this stuff is really lacking. The HCLS W3C group that we have been working with have been having a go at writing a note on representing gene expression data in RDF but it's slow work and I still don’t know if we’re on the right lines. Which also suggests we need better ways of evaluating this stuff. What does it all mean once it's out there, in RDF? Can I use it, integrate with it and how? Most importantly, is it correct? It's not just about sticking a URI on everything and making it RDF – that's too simple if we want to use this in more computational approaches to analysing and solving real biological problems. One of the big problems I've found in this area is convincing funding agencies that these approaches can result in discoveries in biology when the evidence for this is conceptually strong but practically weak. As is often the case, we have a scenario in which waiting for discoveries before giving more funding will mean the discoveries never come, because the work is never funded. If I had a single want it would be some focused calls for this stuff across the likes of the BBSRC and EU Framework 7 with some specific biological objectives. There is hope on this front - Douglas Kell gave a plenary talk at last year's SWAT4LS so there is some recognition of the importance of this area.

My comrade in arms Simon Jupp gave a talk recently on the RDF work we have been doing (mainly Simon) which I added a subtitle to: ‘after the hype, the biology’. The hype may be fading, the biology may be surfacing but there is still much to do.

*Congratulation to Tomasz Adamusiak on winning an iPad in the Nanopublication competition. He celebrated Tomasz style; an evening at the Moulin Rouge.*