Ok, now that I have your attention, let me start backpedalling. In the Australian e-Research sector, I’ve worked on problems of research data management (getting researchers’ data into some kind of database) and registering (recording the existence of that data somewhere else, like an institutional store, or Research Data Australia). There hasn’t been a huge amount of discussion about the mechanics of reusing data: the discussion has mostly been “help researcher A find researcher B’s data, then they’ll talk”.
Now, I think the mechanics of how that data exchange is going to take place will become more important. We’re seeing the need for data exchange between instances of MyTardis (eg, from a facility to a researcher’s home institution) and we will almost certainly see demand for exchange between heterogeneous systems (eg, MyTardis to/from XNAT). And as soon as data being published becomes the norm, consumers of that data will expect it to be available in easy to use formats, with standard protocols. They will expect to be able to easily get data from one system into another, particularly once those two systems are hosted in the cloud.
Until this week, I didn’t know of a solution to these problems. There is the Semantic Web aka RDF aka Linked Data, but the barrier to entry is high. I know: I tried and failed. Then I heard about OData from Conal Tuohy.
- An application-level data publishing and editing protocol that is intended to become “the language of data sharing”
- A standardised, conventionalised application of AtomPub, JSON, and REST.
- A technology built at Microsoft (as “Astoria”), but now released as an open standard, under the “We won’t sue you for using it” Open Specification Promise.
You have OData producers (or “services”) and consumers. A consumer could be: a desktop app that lets a user browse data from any OData site; a web application that mashes up data between several producers; another research data system that harvests its users’ data from other systems; a data aggregator that mirrors and archives data from a wide range of sites. Current examples of desktop apps include a plugin to Excel, a desktop data browser called LINQPad, a SilverLight (shudder) data explorer, high-powered business analytics tools for millionaires, a 3.5 star script to import into SQL Server…
I already publish data though!
Perhaps you already have a set of webservices with names like “ListExperiments”, “GetExperimentById”, “GetAuthorByNLAID”. You’ve documented your API, and everyone loves it. Here’s why OData might be a smarter choice, either now, or for next time:
- Data model discovery: instead of a WSDL documenting the *services*, you have a CSDL describing the *data model*. (Of course if your API provides services other than simple data retrieval or update, that’s different. Although actually OData 3.0 supports actions and functions.)
- Any OData consumer can explore all your exposed data and download it, without needing any more information. No documentation required.
What do you need to do to provide OData?
To turn an existing data management system (say, PARADISEC’s catalogue of endangered language field recordings) into an OData producer, you bolt an Atom endpoint onto the side. Much like many systems have already done for RIF-CS, but instead of using the rather boutique, library-centric standards OAI-PMH and RIF-CS to describe collections, you use AtomPub and HTTP to provide individual data items.
Specifically, you need to do this:
- Design an object model for the data that you want to expose. This could be a direct mapping of an underlying database, but it needn’t be.
- Describe that object model in CSDL (Conceptual Schema Definition Language), a Microsofty schema serialised as EDMX, looking like this example.
- Provide an HTTP endpoint which responds to GET requests. In particular:
- it provides the EDMX file when “$metadata” is requested.
- it provides an Atom file containing requested data, broken up into reasonably sized chunks. Optionally, it can provide this in JSON instead.
- optionally, it supports a standard filtering mechanism, like “http://data.stackexchange.com/stackoverflow/atom/Badges?$filter=Id eq 92046“. (Ie, return all badges with Id equal to 92046)
- For binary or large data, the items in the Atom feed should link rather than contain the raw data.
- Optionally support the full CRUD range by responding to PUT, PATCH and DELETE requests.
There are libraries for a number of technologies, but if your technology doesn’t have an OData library, it probably has libraries for components like Atom, HTTP, JSON, and so on.
As an example, Tim Dettrick (UQ) and I have built an Atom dataset provider for MyTardis – before knowing about OData. To make it OData compliant probably requires:
- modifying the feed to use (even more) standard names
- following the OData convention for URLs
- following the OData convention for filtering (eg, it already supports page size specification, but not using the OData standard naming)
- supporting the $metadata request to describe its object model
- probably adding a few more functions required by the spec.
So, why not Linked Data?
Linked Data aka RDF aka Semantic Web is another approach that has been around for many years. Both use HTTP with content negotiation, an entity-attribute-value data model (ie, equivalent to XML or JSON), stable URIs and serialise primarily in an XML format. Key differences (in my understanding):
- The ultimate goal of Linked Data is building a rich semantic web of data, so that data from a wide variety of sources can be aggregated, cross-checked, mashed up etc in interesting ways. The goal of OData is less lofty: a standardised data publishing and consuming mechanism. (Which is probably why I think OData works better in the short term)
- Linked Data suggests, but does not mandate, a triplestore as the underlying storage. OData suggests, but does not mandate, a traditional RDBMS. Pragmatically, table-row-column RDBMS are much easier to work with, and the skills are much more widely available.
- RDF/XML is hard to learn, because it has many unique concepts: ontologies, predicates, classes and so on. The W3C RDF primer shows the problem: you need to learn all this just in order to express your data. By contrast, OData is more pragmatic: you have a data model, so go and write that as Atom. In short, OData is conceptually simple, whereas RDF is conceptually rich but complex.[I'm not sure my comparison is entirely fair: most of my reading and experience of Linked Data comes from semantic web ideologues, who want you to go the whole nine yards with ontologies that align with other ontologies, proper URIs etc. It may be possible do quickly generate some bastardised RDF/XML with much less effort - at the risk of the community's wrath.]
- Linked Data is, unsurprisingly, much better at linking outside your own data source. In fact, I’m not sure how possible it is in OData. In some disciplines, such as the digital humanities cultural space, linking is pretty crucial: combining knowledge across disciplines like literature, film, history etc. In hard sciences, this is not (yet) important: much data is generated by some experiment, stored, and analysed more or less in isolation from other sources of data.
- Linked Data and OData will probably play nicely together if efforts to use schema.org ontologies are successful.
- There is room for more than one standard.
- In some spaces, OData is already gaining traction, while in others Linked Data is the way to go. This will affect which you choose. IMHO it is telling that StackExchange (a programmer-focused set of Q&A sites) publishes its data in OData: it’s a programmer-friendly technology.
It is highly implausible that Google will be beaten by Microsoft in such an important area as data sharing. And yet Google Data Protocol (aka GData) is not yet a contender: it is controlled entirely by Google, and limited in scope to accessing services provided by Google or by the Google App engine. A cursory glance suggests that it is in fact very similar to OData anyway: AtomPub, REST, JSON, batch processing (different syntax), a filterying/query syntax. I don’t see support for a data model discovery, however: there’s no $metadata.
So, pros and cons?
- Uses technologies you were probably going to use anyway.
- A reasonable (but heavily MS-skewed) ecosystem
- Potentially avoids the cost of building data exploration and query tools, letting some smart client do that for you.
- Driven by the private sector, meaning serious dollars invested in high quality tools.
- Driven by business and Microsoft, meaning those high quality tools are expensive and locked to MS technologies that we don’t use.
- Future potential unclear
- Lower barrier to entry than linked data/RDF