Building applications on large datasets
Recently there's been a lot of discussion in the technology scene about open-data and ways of working with large datasets. A recent high profile story was the UK government's project, which aims to make reams of public data accessible on-line in electronic form. The opening up of this data suggests many .
In this article we look at the emerging "Data Marketplace" providers, and build a small example application for querying 91Èȱ¬ Programmes data on top of Kasabi's new platform
The data.gov.uk data is only a small part of a world of open-data, some of it free-form and difficult to search and index, some of it
Developers who wish to build applications which analyse and present data from across a multitude of different datasets face considerable challenges. First the data must be collected, then normalised or cleansed into a usable form and finally stored in a datastore of some kind so that it can be queried and . The interest and promise of open-data really lies in the latter stage of this process, but so much work has to be invested in the former that many projects die an early death.
Collecting and cleansing datasets and providing scalable datastores is therefore a tempting market for a raft of upcoming start-ups and more traditional specialists. A from the start-up gives an excellent overview of this emerging field and the companies that are hoping to find success within it.
Working with open datasets
Chris Needham and I were looking for a way to query 91Èȱ¬ Programmes data to extract information related to a particular point in time. We were provided with early access to the Kasabi platform, currently in closed beta.
Kasabi is a start-up venture incubated within , a company with a history of providing data services to libraries and other public institutions. Kasabi, is currently in closed beta, but we were allowed to have a look inside with the proviso that we would feed back ideas and suggestions and attempt to build a small prototype on top of the platform.
The Kasabi platform consists of, behind the scenes, the Talis datastore. This is a scalable graph database which can ingest data in RDF format and provides a interface for querying and extracting subsets of the database. On top of this there is a Web front end written in which allows you to explore datasets (and their licences) and see what other people are building on them. The most interesting part of the platform though is a middle-layer with allows developers to create APIs on top of datasets by, for example, creating a stored SPARQL query but presenting the results in JSON through a simple API call. We'll come onto an example of this later.
Kasabi has already ingested many datasets into their platform including DBPedia People, Foodista, a large set of Open Government data and significantly for us, the 91Èȱ¬'s , and data.
What Was On? A simple 91Èȱ¬ Schedule query API
Chris and I set about creating a small application using the Kasabi platform. We wanted to know: "given a point in time tell me what was on all of the 91Èȱ¬'s TV and Radio services at that point". This is currently difficult to do with the /programmes API as it would involve fetching the schedules for each service individually, combining and then narrowing them down to a particular time.
We came up with a simple API which allows you to submit a
string and get back an array of JSON objects containing the broadcasts
corresponding to that point in time. For example
returns broadcasts
for 8am on 10th March 2011. You can add a ?limit=N
query parameter
to increase the number of results.
This API was created using a small app on which makes a call to a Kasabi API we created and appends our private application key.
On the Kasabi side we created a stored SPARQL query, a detailed explanation of which is probably beyond the scope of this post. For those that understand SPARQL here's the query in full, for those that don't O'Reilly has a on the subject, and the itself is quite readable.
PREFIX event: <> PREFIX tl: <> PREFIX po: <> PREFIX xsd: <> PREFIX dc: <> SELECT ?title ?short_synopsis ?service ?broadcast ?broadcast_start ?broadcast_end WHERE { ?episode po:version ?version . ?episode po:short_synopsis ?short_synopsis . ?episode dc:title ?title . ?broadcast po:broadcast_of ?version . ?broadcast event:time ?event_time . ?broadcast po:broadcast_on ?service . ?broadcast po:schedule_date ?date . ?event_time tl:start ?broadcast_start . ?event_time tl:end ?broadcast_end . FILTER ( (?broadcast_start < ?time) && (?broadcast_end > ?time) ) }
In SPARQL variables begin with a ?
. Kasabi uses this to allow
parameters to be provided through query strings. The actual API call
our Heroku app makes looks more like this:
https://api.kasabi.com/api/
We had to provide separate date
and time
parameters to make the
query execute in an acceptable time. Using the FILTER
operation in
SPARQL can be quite slow so Kasabi's helped us to improve
performance by first scoping the data to a specific date by specifying
the po:schedule_date
relationship.
The API we've produced is an interesting proof-of-concept that has applications to our work around . It also demonstrates the power of having all of our data in a form that is easy to search and query. Feel free to try some sample queries of your own, but as it stands the API comes with no guarantees of accuracy or availability!
One possible extension to this API would be to support querying the schedule data of other broadcasters. However, at the moment Kasabi do not support querying across multiple datasets, and as far as I am aware schedule data from other broadcasters is not available in a form that would be easy to ingest into a datastore. I can also imagine combining time-based searches of 91Èȱ¬ data with other historical datasets, to find programmes that relate to particular events in the news, for example.
We haven't yet experimented with some of the other ways of working with the data, but many are provided including faceted search, augmentation of RSS feeds with data from the dataset, and an implementation of the . There's also a describing other applications that have been built on the platform.
Conclusion
At the 91Èȱ¬ we produce a lot of data and make some of it available via our APIs. We were also the first public sector organisation to publish linked data and are now starting to use semantic data platforms to power some of our websites. But the potential for aggregated and searchable data that spans multiple domains is much larger.
Last year, issued an interesting challenge "". The question generated a lot of exciting ideas on and in the . The enabling technology has been around for some time, but with the emergence of hosted service providers giving developers the ability to query across multiple datasets we are starting to see the potential for realising some of these ideas.
Comments Post your comment