Sports Refresh: Dynamic Semantic Publishing
Hi, I'm Jem Rayfield, and I work as Lead Technical Architect for the News and Knowledge Core Engineering department.
This blog post describes the technology strategy the 91Èȱ¬ Future Media department is using to evolve from a relational content model and static publishing framework towards a fully dynamic semantic publishing (DSP) architecture. The DSP architectural approach underpins the recently re-launched and refreshed 91Èȱ¬ Sports site and indeed the 91Èȱ¬'s Olympics 2012 online content.
DSP uses linked data technology to automate the aggregation, publishing and re-purposing of interrelated content objects according to an domain-modelled information architecture,Ìýproviding a greatly improved user experience and high levels of user engagement.
The DSP architecture curates and publishes HTML and aggregations based on embedded Linked Data identifiers, ontologies and associated inference.
( - is based upon the idea of making statements about concepts/resources in the form of subject-predicate-object expressions. These expressions are known as in RDF terminology. The subject denotes the resource; and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. For example, to represent the notion "Frank Lampard plays for England" in RDF is as a triple, the subject is "Frank Lampard"; the predicate is "plays for" and the object is "England Squad".)
RDF semantics improve navigation, content re-use, re-purposing, search engine rankings, journalist determined levels of automation ("edited by exception") and will in future support semantic advertisement placement for audiences outside of the UK. The DSP approach facilitates multi-dimensional entry points and a richer navigation.
91Èȱ¬ News, 91Èȱ¬ Sport and a large number of other web sites across the 91Èȱ¬ are authored and published using an in-house bespoke content management/production system ("CPS") with an associated static publishing delivery chain. Journalists are able to author stories, manage indices and edit audio/video assets in the CPS and then publish them pre-baked as static assets to the 91Èȱ¬'s Apache web server farm. In addition, journalists can edit and manage content in the CPS for distribution to the 91Èȱ¬ Mobile and Interactive TV services, and IPConnected TV services. The CPS has been constantly evolving since it was developed to publish the 91Èȱ¬ News website, which launched in November 1997, and the latest version (v6) underpins the summer 2010 redesign of the 91Èȱ¬ News site that won
The first significant move away from the CPS static publishing model by the 91Èȱ¬'s Future Media department was through the creation of the .
From first using the site, the most striking changes are the horizontal navigation and the larger format high-quality video. As you navigate through the site it becomes apparent that the rich ontological domain model provides a far deeper way of exposing 91Èȱ¬ content than can be achieved through a traditional content management system with its associated relational model and static publishing solution.
AsÌýI wrote here at the time,Ìýthe 91Èȱ¬ World Cup 2010 site featured 700-plus team, group and player pages, which are powered by our high-performance DSP architecture.
Previously, 91Èȱ¬ Sport would never have considered creating this number of indices in the CPS, as each index would need an editor to keep it up to date with the latest stories, even where automation rules had been set up. To put this scale of task into perspective, the World Cup site had more index pages than the rest of the 91Èȱ¬ Sport site in its entirety.
The DSP architectural approach enables the 91Èȱ¬ to support greater breadth and scale, which was previously impossible using a static CMS and associated static publishing chain. DSP allows the 91Èȱ¬ to support and underpin the scale and ambition of the recently refreshed 91Èȱ¬ Sports site and indeed the Olympics 2012 pages.
The entire football section of the refreshed sports site is orchestrated by automated annotation-powered aggregations. The DSP architecture automatically authors a page for every football team and football competition within the UK in addition toÌýa page for every Olympic athlete (10000+), team (200+), discipline (400-500) and dozens of venue pages.
The number of automated pages managed by the DSP architecture is now well in excess of ten thousand. This number of pages is simply impossible to manage using a static CMS driven publishing stack.
Since the World Cup the DSP architecture has been augmented with a Big Data scale content store () for managing rapidly changing statistics, navigation and in the future all content objects, thus evolving the architecture completely away from its static publishing roots.
DSP enables the publication of automated metadata and content state driven web pages that require minimal journalist management, as they automatically aggregate and render links to relevant stories and assets.
( is data about data. In this instance, it provides information about the content of a digital asset. For example, a World Cup story might include metadata that describes which football players are mentioned within the text of the story. The metadata may also describe the associated team, group, or organization associated to the story.)
The published metadata describes the 91Èȱ¬ Sport content at a fairly low-level of granularity, enabling rich content relationships and semantic navigation. Querying the published metadata enables the creation of dynamic page aggregations such as Football Team pages or Athlete pages. Published sports stats and navigation are mapped to the ontology and allows dynamic publication of statistics and navigation against automated indices.
The 91Èȱ¬ is evolving its publishing architecture towards a model which will allow all content objects and aggregation to be served and rendered on a dynamic request-by-request basis to support rich navigation, state changes such as event or time and, potentially, personalisation; with the information architecture and page layout reacting to underlying semantics and meta model.
The remainder of this post will describe howÌýthe 91Èȱ¬ intends to evolve the static publishing CPS and the semantic annotation and dynamic metadata publication used for 91Èȱ¬ Sport site towards its eventual goal of a fully dynamic semantic publishing architecture.
Static publishing and CPS content management
The CPS has been designed and developed in-house, and so its workflow and process model has evolved to its current form (v6) through continuous iteration and feedback from the 91Èȱ¬ journalists who use it. They author and publish content for the product development teams to build the 91Èȱ¬ News and Sport websites. When looking at the requirements for the recently redesigned and refreshed News site, the FM department considered evaluating proprietary and open-source solutions in the CMS market for shiny new features.
However the wonderful and interesting thing about the CPS is that most 91Èȱ¬ journalists who use it value it very highly. Compared to my experience with many organisations and their content management systems it does a pretty decent job.
The CPS client is built using Microsoft .Net 3.5 and takes full advantage of (WPF). The following screen shots of the CPS user interface illustrates some of its features.
Fig 1a: Screen shot of the CPS story-editing window
Fig 1b: 91Èȱ¬ CPS, showing the index editor
Figure 1 depicts a screen shot of its story-editing window. The CPS has a number of tools supporting its story editing functions such as managing site navigation, associating stories to indices and others such as search.
As you can see there is a component-based structure to the story content - figure 1a showsÌýa video, an introduction and a quote.
These components are pre-defined allowing a journalist to drag and drop as desired. It is clear that the UI is not a editor. The current incarnation of the CPS focuses on content structure rather than presentation or content metadata.
Although the editor is not WYSIWIG, CPS content is available for preview and indeed publication to a number of audience facing outputs and associated devices. On publication, CPS assets are statically rendered for audience-facing output - flavours include RSS, Atom, High-Web XHTML, JSON, Low-Web XHTML and mobile outputs.
Fig 2: 91Èȱ¬ News CPS static publishing
The static CPS delivery architecture (depicted in Fig 2 above) provides a highly scalable and high performance static content object-publishing framework.
The CPS UI utilises a data layer API abstraction which proxies the underlying persistence mechanism (anÌýOracle ). The abstracted relational data model captures and persists stories and media assets as well as site structure and associated page layout.
The CPS UI allows the journalist to author stories, media and site structure for preview, eventual publication and re-publication.
A daemon process, the CPS publisher, subscribes to publication events for processing and delivery.
The CPS publisher contextualises content objects in order that they are appropriate for required audience/platform output. Filtered, contextualised assets are rendered by the CPS publisher as a static file per output type. The CPS publisher uses a Ìý(MVC) architectural patternÌýto separate the presentation logicÌýfrom the .
Each output representation is made persistent onto a (SAN). The 91Èȱ¬'s home-grown content delivery chain subscribes to SAN changes and publishes each output from a secure content creation network onto a set of head Ìýservers accessible to the audience.
Although the CPS relational content model and static publishing mechanism scales and performs well it has a number of functional limitations. CPS authored content has a fixed association to manually administered indices and outputs are fixed in time without any consideration to asset semantics, state changes or semantic metadata. Re-using and re-purposing CPS authored content to react to different scenarios is very difficult due to the static nature of its output representations. Re-purposing content within a semantic context driven by metadata is impossible without manual Journalist management and re-publishing. Manual complex data management inevitably leads to time, expense and data administration headaches.
The CPS relational data model currently has a very simple metadata model capturing basic items such as author, publish date and site section. Extending the CPS relational content model to support a rich metadata model becomes complex. When designing a knowledge domain annotation schema using a relational approach, one can start by trying to create a flat controlled vocabulary, which can be associated to content objects. However, this quickly breaks - as semantics are very unclear. Evolving this further, a flat controlled vocabulary can be grouped into vocabulary categories; nevertheless, a restrictive and hierarchal taxonomical annotation schema soon evolve again. As concepts need to be shared this gives rise to vocabulary repetition and ambiguity. A taxonomic hierarchy further evolves into a graph, allowing concepts to be shared and re-used to ensure that semantics are disambiguous and knowledge is concise.
Implementing a categorised controlled vocabulary within a relational database introduces complexity; creating a hierarchy introduces further complexity, and implementing within a relation model takes things past the useable limits of a relational model. If you then add in requirements for reasoning based on metadata semantics then relational databases, associated SQL and schemas are no longer applicable solutions and are simply redundant in this problem space.
Dynamic Semantic Annotation Driven Publishing
The primary goals of the 91Èȱ¬ World Cup 2010 web site were to promote the quality of the original, authored in-house 91Èȱ¬ content in context and to increase its visibility and longevity by improving the breadth and depth of navigational functionality.
Increasing user journeys through the range of content while keeping the audience engaged for longer browser session durations meant that a larger more complex information architecture was required than that traditionally managed by 91Èȱ¬ journalists.
Creating a website navigation for 700+ Player, Team, Group and Match pages posed a problem as the traditional CPS manual content administration processes would not scale. An automated solution was required in order that a small number of journalists could author and surface the content with as light a touch as possible; and automatically aggregate content onto the 700+ pages based on the concepts and semantics contained within the body of the story documents.
Fig 3: Dynamic RDF automated
The information architecture gave rise to a domain model which included concepts and relationships such as time and location; events and competitions; groups, leagues and divisions; stages and rounds; matches; teams, squads and players; players within squads, teams playing in groups, groups within stages, etc.
Clearly, the sport domain soon gives rise to a fairly complex metadata model. When you then include a model that describes the assets that need to be aggregated with a semantic association to the sport domain, it is quickly apparent that using a relational database is not an appropriate solution. The 91Èȱ¬ needed to evolve beyond a relational CPS static architecture.
The DSP architecture and its underlying publishing framework do not author content directly; rather it publishes data about the content - metadata. For the World Cup, the published metadata described the content at a fairly low-level of granularity, providing rich content relationships and semantic navigation. By querying this published metadata we were able to create automatic dynamic page aggregations for Teams, Groups and Players.
The foundation of these dynamic aggregations was a rich ontological domain model. The ontology described entity existence, groups and relationships between the things/concepts that describe the World Cup. For example, "Frank Lampard" was part of the "England Squad" and the "England Squad" competed in "Group C" of the "FIFA World Cup 2010".
The ontology model also described journalist-authored assets - stories, blogs, profiles, images, video and statistics - and enabled them to be associated to concepts within the domain model. Thus a story with an "England Squad" concept relationship provides the basis for a dynamic query aggregation for the England Squad page "All stories tagged with England Squad" (Figure 3). The required domain ontology was broken down into three basic areas asset, tag and domain ontologies (Figure 4) forming a triple, thus allowing a journalist to apply a triple-set to a static asset, such as associating the concept "Frank Lampard" with a story "Goal re-ignites technology row".
The tagging ontology was kept deliberately simple in order to protect the journalist from the complexities of the underlying domain model. A simple set of asset/domain joining predicates, such as "about" and "mentions", drive the annotation tool UI and workflow, keeping the annotation simple and efficient, without losing any of the power of the associated knowledge model.
Fig 4: The Asset (left), Tag (middle) and Domain (right) Ontologies used in the World Cup 2010, simplified for brevity
In addition to a manual selective tagging process, Journalist-authored content is automatically analysed against the domain ontology. A natural language determiner process automatically extracts concepts embedded within a textual representation of a story. The concepts are moderated and, again, selectively applied before publication. Moderated, automated concept analysis improves the depth, breadth and quality of metadata publishing.
The following screen shots describe the process of content annotation.
Fig 5a: A journalist, using the Graffiti tool, applies the sport concept "Gareth Barry" to a story about the footballer
Fig 5b: Annotating a story with the location Milton Keynes in the Graffiti tool
The journalist applies suggested annotations as well as searching for triplestore-indexed concepts.
As you can see all ontology concepts are linked to (LOD) identifiers (DBPedia, Geonames etc.). ("Linked open data" describes a method of exposing, sharing, and connecting data via ). This allows a journalist to correctly disambiguate concepts such as football players or geographical locations.
Journalist-published metadata is captured and made persistent for querying using the resource description framework (RDF) metadata representation and triple store () technology.
Fig 6: Semantic World Cup 2010 publishing, powered by a triplestore
Figure 6 depicts the dynamic semantic architecture built to publish metadata driven static asset aggregations. A (RDF metadata database) and (RDF query language)Ìýapproach was chosen over and above traditional relational database technologies due to the requirements for interpretation of metadata with respect to an ontological domain model.
The high-level goal is that the domain ontology allows for intelligent mapping of journalist assets to concepts and queries.
The chosen triple-store provides reasoning following the and thus implicitly inferred statements are automatically derived from the explicitly applied journalist metadata concepts.
For example, if a journalist selects and applies the single concept "Frank Lampard", then the framework infers and applies concepts such as "England Squad", "Group C" and "FIFA World Cup 2010" (as generated triples within the triple store). Thus the semantics of the ontologies, the factual data, and the content metadata are taken into account during query evaluation. The triple-store was configured so that it performed reasoning with the semantics of all this data - at real time, hundreds of updates per minute while millions of concurrent requests occur against the same database.
This inference capability makes both the journalist tagging and the triplestore powered SPARQL queries simpler and indeed quicker than a traditional SQL approach. Dynamic aggregations based on inferred statements increase the quality and breadth of content across the site. The RDF triple approach also facilitates agile modelling, whereas traditional relational schema modelling is less flexible and also increases query complexity.
The 91Èȱ¬ triple store is deployed multi-data centre in a resilient, clustered, performant and horizontally scalable fashion, allowing future expansion for additional domain ontologies and if required, linked open data sets.
The triple store is abstracted via a that uses the , the , ÌýJava web services framework, andÌýthe API specification.
The REST API is accessible via HTTPs with an appropriate certificate.
The API is designed as a generic façade onto the triple-store allowing RDF data to be re-purposed and re-used pan 91Èȱ¬. This service orchestrates SPARQL queries and ensures that results are dynamically cached with a low,Ìýone minute 'time-to-live' (TTL) expiry cross data centre, using .
All RDF metadata transactions sent to the API for CRUD operations are validated against associated ontologies before any persistence operations are invoked. This validation process ensures that RDF conforms to underlying ontologies and ensures data consistency. The validation libraries used include . The API also performs content transformations between the various flavours of RDF such as N3 or XML RDF.
Automated XML sports stats feeds from various sources are delivered and processed by the 91Èȱ¬. These feeds are now also transformed into an RDF representation. The transformation process maps feed-supplier IDs onto corresponding ontology concepts, and thus aligns external provider data with the RDF ontology representation within the triple store. Sports stats for Matches, Teams and Players are aggregated inline and served dynamically from the persistent triple store.
Page Rendering
The dynamic aggregation and publishing page-rendering layer is built using a and memcached stack.
The PHP layer requests an RDF representation of a particular concept or concepts from the REST service layer based on the audience's URL request. So if an "England Squad" page request is received by the PHP code several RDF queries will be invoked over HTTPs to the REST service layer below.
The render layer will then dynamically aggregate several asset types (stories, blogs, feeds, images, profiles and statistics) for a particular concept such as "England Squad". The resultant view and RDF is cached with a low TTL (one minute) at the render layer for subsequent requests from the audience. The PHP layer dynamically renders views based on HTTP headers providing content negotiated HTML and/or RDF for each and every page.
The World Cup made use of existing infrastructure utilising the significant number of existing static news kit (apache servers, HTTP load balancers and gateway architecture) all HTTP responses are annotated with appropriate low (one minute) cache expires headers. This HTTP caching increases the scalability of the platform and also allows caching if demand requires.
The DSP architecture served millions of page requests a day throughout the World Cup with continually changing semantic RDF data. It served an average of a million SPARQL queries per day for the duration of the tournament, with a peak RDF transaction rate of hundreds of player statistics per minute. Cache expiry at all layers within the framework isÌýone minute enabling a dynamic, rapidly changing domain and statistic-driven user experience.
Sport Refresh and Olympics Dynamic Publishing
The refreshed 91Èȱ¬ Sports site is currently served to the audience using a combination of the two architectural approaches previously described: static publishing and DSP. The parts of the Sports site which are published using DSP or static publication are visible to the audience - the flavours of URL show which system publishes the page.
The refreshed 91Èȱ¬ Sports site mashes static and dynamic published assets onto statically published pages via a server side include mechanism. This enables the 91Èȱ¬ to migrate a proportion of its content onto the DSP architecture in a gradual phased manner. The end goal is that the static publication chain can be retired.
Assets which are published via the static publication chain are exposed to the audience via URL's which are prefixed with https:// www.bbc.co.uk/sport/0/. For example:
- Sport 91Èȱ¬ Page: /sport/0/
- Football Index: /sport/0/football/
- Golf Index: /sport/0/golf/
- Football Story: /sport/0/football/17088995
Fig 7: The statically published 91Èȱ¬ Sport 91Èȱ¬ page(Including dynamic navigation and dynamic sport statistics)
The CPS powered static publishing mechanism is currently used to curate, author, manage and publish 91Èȱ¬ sports stories and editorially curated indices such as the main sports index and football index.
These assets are hand crafted, content managed, orchestrated and published by journalists.
When these Sports site pages are statically published they include and combine references to dynamic content. These references, known as server side includes (SSI), are resolved at render time at the apache web server farm. (SSIs are part of a simple interpreted scripting language which allows content from one or more sources to combined into a static web page.)
The mainly static pages then combine dynamic content such as statistics and navigation into a single page output for consumption by the audience. A static story combined with dynamic navigation and dynamic statistics would be a good example of this mixed publication chain approach. The cacheable proxied SSI mechanism mashes together the content from the static platform and dynamic platform allowing a phased migration towards a fully dynamic 91Èȱ¬ sports site.
Automated annotation driven aggregation pages such as Football Team, Olympic Athlete, Olympics Discipline, and Olympics Venue are powered using the DSP approach. These pages are fully automated requiring no content management or journalist content management overhead. These pages do not contain any static content; they are fully dynamic and contain only references to static content objects such as stories or videos.
Journalists annotate 91Èȱ¬ content objects such as a sports story or a video with concepts such as an athlete or a football team. Content objects are then automatically aggregated onto pages published using the newer DSP stack. For example:
- Chelsea Football Club: All the content objects associated to the concept "Chelsea"
- Tom Daley: All the content objects associated to the concept "Tom Daley"
/sport/olympics/2012/athletes/02025fcb-457d-4a77-8424-f5b8fe49b87f
- Team GB: All the content objects associated to the concept "Team GB"
/sport/olympics/2012/countries/great-britain
Fig 8: The Chelsea FC team dynamic 91Èȱ¬ Sport page including automated metadata aggregations, dynamic sports stats and dynamic Sport navigation.
The navigation and sports statistics contained on this page are rendered on a request-by-request basis from the underlying XML content Store (MarkLogic).
The story, video, comment and analysis assets contained on this page are rendered on a request-by-requests basis from the underling RDF store (BigOWLIM).
The Sport ontology and Meta model which powers these automated annotation powered aggregations has now been published and can be re-used under a .
Fig 9: The 91Èȱ¬ Sport ontology as applied to Olympics 2012 Track Cycling
As you can see the model defines a simple yet generic sport ontology, which is capable of modelling sports from Football to the Men's Cycle Sprint within the Olympics 2012.
All the DSP powered pages on the sport site use this ontology model as its foundation. A simple asset model describing assets such as stories and videos linked to the Sport domain representation allows very rich dynamic content object aggregation.
The DSP's Natural Language processing and concept suggestion tool, which powers the Graffiti annotation tool, is now ontology aware. When additional concepts are added into the triple store (for example a new athlete) these concepts are immediately suggested to the Journalist as concepts for annotation. This feedback loop ensures that changes in the ontology instance data are reflected in all components of the DSP architecture.
Fig 10: Ontology aware natural language processing and annotation suggestion
The refreshed 91Èȱ¬ Sport site's horizontal navigation is powered by a content model, which links ontology concepts to navigation entries.
This allows navigating to and automatically aggregating content from navigation linked to metadata concepts.
The underlying navigation data and associated content model are stored within a new addition to the DSP architecture - a highly scaled and high performance fault tolerant Big Data Store namely MarkLogic.
Sports statistics provided by third party suppliers are also now stored as XML content within this query-able Content Store. The 91Èȱ¬ sports site queries these XML fragments adds value and re-formats the statistics in a form consumable on the sports site.
The Content Store which currently powers all of the statistics and navigation on the sports site has been scaled to handle ingesting many thousands of content objects per second whilst concurrently supporting many millions of dynamic page renditions and impressions a day. This high performance content store will allow the 91Èȱ¬ Sports site to ingest and render sport statistics including live football scores, live football tables, live Olympics event statistics and results in near real-time whilst rendering this content dynamically using the DSP approach.
The refreshed sport site makes use of this new addition to DSP architecture for pages and content such as Live Scores: Football and Live Premier League Tables, Results, and Fixtures.
The DSP's triple store will be used in a purer sense and will now only be concerned with domain and asset metadata - it will not persist or manage content object data.
This clear separation of concerns makes the DSP persistence mechanism scalable.
Metadata is stored within a persistent RDF store suitable for modelling rich graphs. Content objects are stored within a document store suitable for live ingest and rendering.
A clean domain model, which only contains references to unique content objects, allows the content model to evolve and also allows the content to be stored in a de-coupled fashion. As long as the content has a unique identifier which is addressable the asset->tag->domain RDF model allows the triple store to model extendable real work concepts and lets the content store model raw referenced assets.
The Sport RDF currently maps third party statistic identifiers from the sport ontology concepts into sport content objects. This allows querying across the triple-store and content store for sports statistics related to a sport concept e.g. "The league table for the English Premiership".
Fig 11: Dynamic Content Store powered sports statistics
Content objects and sports statistics can then be cut up and arranged on a personalised, metadata driven, request-by-request basis.
The Olympics 2012 sports statistics are to be ingested and delivered to the audience using the same content store and dynamic render architecture. Statistics will be supplied from every Olympics event and venue for every event within the Olympics. These statistics will be ingested in near-real time for inclusion on metadata driven pages and video feeds. This gives the 91Èȱ¬'s online Olympics output a very real sense of live.
The triple-store and content store are abstracted and orchestrated by a REST API. The API will continue to support SPARQL and RDF validation but it will now support XQuery and XML persistence across both the triple-store and the content store.
This allows a content aggregation to be generated using a combination of SPARQL for domain querying and XQuery for asset selection. All content object and metadata are made persistent in transactional manner across both data sources.
The content API "TRiPOD" (Figure 12) makes use of a multi-data centre memcached cluster to store content aggregations and protect the triplestore and content-store from query storms. The API cache is split into a live cache with a typically low cache profile circa one-minute TTL and a second, longer stale cache with an expiry TTL of 72 hours.
Memcache is also used to control SPARQL/ invocation using a memcache-based locking strategy.
If the live cache has expired a lock is created and a single query invocation thread per data-center is invoked. Subsequent requests are served from stale until the query responds refreshing both the stale and live cache. This caching and locking strategy enables the DSP platform to scale to many millions of page requests and associated backend queries a day.
Fig 12: DSP architecture combining SPARQL/XQuery, RDF store, and XML Store
The Future: Fully Dynamic Publishing
Although the 91Èȱ¬ Sport architecture enables static asset content aggregation and re-purposing based on dynamic triple-store RDF metadata it currently does not support dynamic editorial authored asset rendering.
Assets such as stories are currently statically published rendering them fixed and immutable.
The refreshed 91Èȱ¬ Sports site will eventually require content objects to be cut-up, arranged and rendered with respect to state changes and persona.
The ability to render all content object fragments by state and indeed metadata concepts will enable the 91Èȱ¬ Sport web site to facilitate personalised, event driven pages with greater flexibility than that currently achieved for the 91Èȱ¬ sport web site. A re-usable content API which contextualises content objects for device and platform will enable the 91Èȱ¬ to create new outputs and open the 91Èȱ¬ archive to the public.
The DSP architecture (Figure 6) will now take a final evolution - deprecating the static, fixed asset publication in preference for dynamic content object renditions.
Content objects will be dynamically rendered on a request-by-request basis rather than 'fixed-in-time' static publication.
Textual content objects such as stories and editorially authored indexes such as the football home page will be made persistent within the schema independent content store.
The content store supports fine-grained XQuery, enabling search, versioning, and access control.
All editorially authored content objects such as stories and manually managed indices will also be stored within the content store.
The content store is horizontally scalable and allows content to be handled in discreet chunks, supporting the cutting up and repurposing of fine-grained content. Each content object within the content store will be modelled as a discrete document with no interrelationships.
Discrete content objects are to be modelled and referenced via the asset ontology RDF within the triple-store.
Triple-store SPARQL is used to locate, query and search for documents by concept providing all the aggregation and inference functionality required.
The content store is used for fast, scalable queryable and searchable access to the raw content object data while the triple-store continues to provide access to asset references and associated domain models.
The Graffiti annotation tool UI currently only makes it possible for a journalist to annotate static content objects post-publication; it does not integrate with the CPS UI.
Using the Graffiti API within the CPS UI will soon unify and rationalise the journalist's toolset. Merging the Graffiti UI into the CPS UI will provided a single UI for the journalist, supporting the creation and annotation of documents within a single view.
Real-time concept extraction and suggestion will occurr as the journalist authors and then publishes content.
The DSP platform caching approach is fundamental to enable a scalable and performant platform. The API memcache strategy is augmented with HTTP caching between the PHP render layer and the API. The PHP layer also makes use of memcache for page module caching; all page fragments are cached at a ESI page assembly layer with corresponding HTTP caching. The site as a whole is also for further scalability and resilience during very large traffic spikes.
Conclusion
A technical architecture that combines a document/content store with a triple-store proves an excellent data and metadata persistence layer for the 91Èȱ¬ Sport site and indeed future builds including 91Èȱ¬ News mobile.
- A triple-store provides a concise, accurate and clean implementation methodology for describing domain knowledge models.
- An RDF graph approach provides ultimate modelling expressivity, with the added advantage of deductive reasoning.
- SPARQL simplifies domain queries, with the associated underlying RDF schema being more flexible than a corresponding SQL/RDBMS approach.
- A document/content store provides schema flexibility; schema independent storage; versioning, and search and query facilities across atomic content objects.
- Combining a model expressed as RDF referencing content objects in a scalable document/content-store provides a persistence layer that uses the best of both technical approaches.
This combination removes the shackles associated with traditional RDBMS approaches.
Using each data store for what it is best at creates a framework that scales and is ultimately flexible.
Replacing a static publishing mechanism with a dynamic request-by-request solution that uses a scalable metadata/data layer will remove the barriers to creativity for 91Èȱ¬ journalists, designers and product managers, allowing them to make the very best use of the 91Èȱ¬'s content.
Simplifying the authoring approach via metadata annotation opens this content up and increases the reach and value of the 91Èȱ¬'s online content.
Finally, combining the triple approach with dynamic atomic documents as an architectural foundation simplifies the publication of pan-91Èȱ¬ content as "open linked data" between 91Èȱ¬ systems and across the wider linked open data cloud.
Jem Rayfield is a lead architect in 91Èȱ¬ Future Media, specifically focusing on News, Sport & Knowledge products.
Comment number 1.
At 17th Apr 2012, Russ wrote:I'm sure this is all wonderful stuff, but here's the thing: in yesterday's match between Arsenal and Wigan, the three places on the Live Football page showing the score all showed different scores throughout the game.
Very ontological I thought.
Russ
Complain about this comment (Comment number 1)
Comment number 2.
At 17th Apr 2012, Greg Tyler wrote:Absolutely fascinating post. Excellent job.
Complain about this comment (Comment number 2)
Comment number 3.
At 17th Apr 2012, glossymighty wrote:All technical areas of work create their own jargon which is fine for those that work in those areas, but makes them opaque for those outside them. I always told my staff to avoid such jargon in communications to any outside audience which, in their case, was the whole of the NHS. Our technical area was information technology. The above article completely fails the "opaqueness" test for me and that in an area that I thought I knew something about!
Can somebody please send Jem on a plain English course?
Ian
Complain about this comment (Comment number 3)
Comment number 4.
At 18th Apr 2012, Dan wrote:Are there any plans to publish the Olympics ontology as the 91Èȱ¬ has published its Sport ontology? It would be useful to have access to the RDF of sports, disciplines, events, etc. as Linked Open Data.
Complain about this comment (Comment number 4)
Comment number 5.
At 18th Apr 2012, JamesRogers wrote:Wow, this is not a blog post, but a lecture. Very interesting.
Can you explain to me what horizontal navigation has to do with the dynamic publishing. Why can that not be done using vertical navigation?I have a first question. You mention early in your post:
Complain about this comment (Comment number 5)
Comment number 6.
At 18th Apr 2012, Ian McDonald wrote:@glossmighty (#3):
Thank you for your comment.
I subedit blog posts before they go out and make sure they're apt for the blog, so I'll answer your point.
I do work with bloggers to make their posts as accessible as possible. For example, terms are defined and linked to explanations. But some blog posts will always be more specialist than others.
For example, Patrick Sinclair's blog post about the Radio 1 home page was aimed at general audience that's interested in how the 91Èȱ¬ uses technology. Jeremy Tarling's blog post about the technical architecture of the new 91Èȱ¬ Weather website was aimed at fellow web professionals.
Jem's blog post is more specialist; it's certainly never going to be for a mass audience.
I try to make sure that the start of the blog post is a pretty good guide to how technical it is.
Nick and I do think about how to make posts as readable as possible, and it is good to get feedback on this.
Looking at other reactions, Jem's fellow specialists seem to find his post extremely useful and interesting.
Complain about this comment (Comment number 6)
Comment number 7.
At 18th Apr 2012, Nafets wrote:@Russ - an interesting observation. I guess this is not ontology related at all, but is rather a cache issue. As I understand all the statistics are stored as XML in MarkLogic Content Store and a page is a mix of RDF-ized data and XML data - so far so good. I'm puzzled with this massive caching - maybe those 3 places on the Live Football page were getting the match result data with 3 different queries, so when the match result changed not all 3 places got updated at the same time. I agree it is confusing, but it's not that bad, because the TTL is only one minute (still should be fixed though).
@Dan - I guess they're using the Sports ontology for the Olympics as it is generic enough.
@Jem - I'm looking forward to your comment on my comment ;)
Complain about this comment (Comment number 7)
Comment number 8.
At 18th Apr 2012, JamesRogers wrote:@Nafets and @Russ the caching is a bigger problem on the new sports pages. Since the new 91Èȱ¬ Sports system, I often get the football, cricket, or rugby league subhomepage from a week ago (on Firefox with Windows 7) and have to manually refresh.
Complain about this comment (Comment number 8)
Comment number 9.
At 19th Apr 2012, Bart Hoffmann wrote:Very interesting story.
Complain about this comment (Comment number 9)
Comment number 10.
At 19th Apr 2012, jemray wrote:@Nafets
Firstly, sorry for the delay in response. I have been at scaladays here in London. Which was awesome. =
Anyhow...
You are correct cache TTL's will effect the statistics on the sport pages.
"maybe those 3 places on the Live Football page were getting the match result data with 3 different queries, so when the match result changed not all 3 places got updated at the same time"
Different request times and TTL alignments can give rise to out of synch stats. We currently rely on a fairly low set of TTL's to try and minimize this issue. We also do have different cache profiles for different stats. However it is difficult to drop TTL's much lower that 1 minute as this has a direct correlation to the number of requests which traverse the entire stack. Given the large number of requests we receive we need to be be very careful how we protect our back-end for example.
The problem becomes more tricky when you consider that we have 2 data centers which act in a stateless fashion with 2 isolated memcached cache clusters. For example if your browser DNS resolution cache TTL's you may be load-balanced to another data center. Where the cache TTL's may be slightly different to those on the last data-center. Thus again you may see inconsistencies.
So... given the current pull model, stateless isolated data center model and min TTL restriction you may from time to see inconsistencies that will eventually resolve. Not ideal however we are working on tuning the cache profiles so hopefully this will improve.
Cache eviction is also non-trivial as we have many layers of caching from memcached to http caches and indeed ISP and browser caches. The programatic model for forcing cache evictions becomes complex. At this stage we believe this isn't the best model for consistency.
So...we are currently investigating a replacement stats delivery method. Moving from the current browser pull method towards websockets and push delivery... Perhaps the subject of another blog post when/if we get to a position to move forward with this option.
Also on this
"I guess they're using the Sports ontology for the Olympics as it is generic enough."
The current sports ontology is generic enough with a few v.minor modifications which we will be publishing at some point soon.
@JamesRogers
"Can you explain to me what horizontal navigation has to do with the dynamic publishing. Why can that not be done using vertical navigation?"
Horizontal navigation has nothing to do with dynamic publishing this was purely a product decision which enables more real estate and simpler inline navigation. Navigation inline following the domain ontology rather than complex taxonomical left had navigation out of context.
@Dan
"Are there any plans to publish the Olympics ontology as the 91Èȱ¬ has published its Sport ontology?
The sport ontology will be extended with some minor olympics changes and re-published.
We also do have plans on publishing open rdf. However this may not be available in time for the olympics. We plan on opening our underlying service API to the public. This API is able to produce content negotiated RDF. However we have a number of scaling and throttling issues with opening these API's at the moment which we need to work through. In addition an open sparql endpoint is high on my wish list. All things we have on our backlog...but yet with a firm delivery date.
@glossymighty
Apologies for the language, I am partial to the odd acronym. I tried to include explanations within the blog to clear things up. Hopefully as Ian mentioned the text is usable by others. I will attempt to make things clearer in following posts...
Cheers and thanks for your comments!
Jem
Complain about this comment (Comment number 10)
Comment number 11.
At 21st Apr 2012, lucas42 wrote:You say "The PHP layer dynamically renders views based on HTTP headers providing content negotiated HTML and/or RDF for each and every page." but I don't seem to be able to access any RDF. I've trying using an accept header and also adding an extension to the end of the URL (which /music and /programmes accept).
Perhaps I've misunderstood - you've said that the underlying service API isn't public yet, but I assume these issues don't affect the RDF view rendered by the PHP layer.
Also, how do things like "Featured Athletes" and "Featured Countries" fit into this? I assume they're editorially chosen - does this mean they rely on the static publication chain? Or if they're in the dynamic publication side, how do journalists (or whoever chooses) input them; is it through Graffiti?
Finally, you've explained the difference between /sport and /sport/0 (though I don't understand why this difference should be visible in the url). I've noticed lots of pages still use news.bbc.co.uk/sport1/. Are these just legacy pages which don't use this new system yet, or is there a third publication system which you haven't mentioned here?
Complain about this comment (Comment number 11)
Comment number 12.
At 21st Apr 2012, Josh wrote:Jem, this was a fascinating read. Thanks!
Complain about this comment (Comment number 12)
Comment number 13.
At 23rd Apr 2012, Barry Norton wrote:Since this, and previous of Jem's posts, are enthusiastically read across the World by people in the media sector, @glossymighty, perhaps for the sake of clarity you should explain what 'NHS' means.
Complain about this comment (Comment number 13)
Comment number 14.
At 25th Apr 2012, Dominic wrote:Any idea why I'm getting March RSS feeds from the 91Èȱ¬ Sport Everton page, today, in April?
The rss feed seems to be broken for the last couple of months.
feed://newsrss.bbc.co.uk/rss/sportonline_uk_edition/football/teams/e/everton/rss.xml
Complain about this comment (Comment number 14)
Comment number 15.
At 25th Apr 2012, DBOne wrote:RSS link on the Everton page is /sport/football/teams/everton/rss.xml
Complain about this comment (Comment number 15)
Comment number 16.
At 27th Apr 2012, sardire wrote:Jem terrific post and look forward to your updated preso 'Dynamic Semantic Publishing Empowering the 91Èȱ¬ Sports Site and the 2012 Olympics' at SemtechBiz SF June 3 - 7
Complain about this comment (Comment number 16)
Comment number 17.
At 4th May 2012, Andy_Pipes_91Èȱ¬ wrote:@lucas42 The content-negotiated RDF views you mention are part of the road map, but not delivered yet I'm afraid.
Re: Featured Countries/Athletes: these are chosen editorially, but modeled in the Olympics ontology (oly:oneToWatch) against a sport:SportsDiscipline (such as "Archery") or a sport:MedalCompetition (such as "Men's Synchronised 10m Platform") or a sport:CompetitiveSportingOrganisation (such as "Great Britain & N. Ireland"). There are specific RDF feeds for Ones to Watch which we hope to publish soon.
re your question about the other url patterns: yes, legacy urls.
Complain about this comment (Comment number 17)
Comment number 18.
At 27th Jun 2012, Lose Weight wrote:I am sure this will increase the user experience even more. It never hurts to make something good better!
Complain about this comment (Comment number 18)
Comment number 19.
At 28th Jun 2012, Interrossiter wrote:Sorry to say that I've left it months to see if it ever got better, but I'm just a bog standard user who's now not a visitor to the Sport website as it's still awful, colours & content in no way match the previously easily negotiable & user friendly site.
Money apart, why oh why do the 91Èȱ¬s home pages all have to all look & the same - yes, uploading becomes simple but viewing has become truly awful?
Complain about this comment (Comment number 19)