Towards an Editable , Versionized LOD Service for Library Data

The Northrhine-Westphalian Library Service Center (hbz) launched its LOD service lobid.org in August 2010 and has since then continuously been improving the underlying conversion processes, data models and software. The present paper first explains the background and motivation for developing lobid. org. It then describes the underlying software framework Phresnel which is written in PHP and which provides presentation and editing capabilities of RDF data based on the Fresnel Display Vocabulary for RDF. The paper gives an overview of the current state of the Phresnel development and discusses the technical challenges encountered. Finally, possible prospects for further developing Phresnel are outlined.


Introduction 1.
In the broader library world, Linked Open Data (LOD) 1 has gained a lot of attention over the last two years, with projects moving increasingly from theory to practice.The library domain is gearing more and more towards the technical and legal issues implied by this paradigm shift, with the announcement of 'A Bibliographic Framework for the Digital Age' by the Library of Congress (Marcum & Library of Congress, 2011)  Since 2009 the North Rhine-Westphalian Library Service Center (hbz) has been exploring Linked Open Data and Semantic Web technologies, where both the legal and the technological aspects of this ongoing paradigm change in information representation and provision.The hbz launched its LOD service lobid.org-standing for "Linking Open Bibliographic Data" -in August 2010 and since then has continuously been improving it.
The overall goal is to develop the underlying software framework so that its read/write services can be run, including a web presentation of the underlying RDF data and online forms to create, update and delete the underlying information represented in RDF (Resource Description Framework).To keep track of and to be able to revoke changes, the system should also fully versionize the underlying data structures.This paper explains its motivation (section 2) and describes the LOD service lobid.org(section 3), as these developments were initiated to improve the service.In section 4 the Fresnel Display Vocabulary for RDF is introduced which serves as a generic way to confi gure the presentation of RDF data.Section 5 explains Phresnel, a free software framework for presenting and editing RDF data based on Fresnel and PHP.Finally, in section 6 prospects for further developing Phresnel are listed.

Motivation and expected benefits 2.
The hbz has been in the business of cooperative cataloguing for some time now, running a union catalogue since 1973.From this perspective, Linked Open Data provides a very interesting approach for distributed cooperative cataloguing in a web environment.

Web integration
Adherence to international and cross-domain web standards for Linked Data means web integration of library data, from which the following benefits are expected in the long term: . Web-integrated data can easily be harvested by search engines and other discovery services.

Multiple usability •
. RDF data stored in one data sink can easily be used as such by different services within the hbz and beyond.Interoperability and re-usability.

•
Web standards facilitate reuse by reducing the need for conversion processes and post-processing.Flexibility • .RDF and Triple Stores are very flexible regarding extensions and changes in the data model used.

Synergy effects
Following the best practices of the Linked Open Data community and working towards a standardization of the data produced by the different services within the hbz, return of investments in form of intra-and inter-organizational synergy effects are expected.Within the organization, we already see some effects on hbz projects reusing each others' data.With LOD, this is possible in a straightforward way, whereas having to deal with proprietary interfaces and different formats often makes it quite labor-intensive for services to communicate effectively.Thus, standardization of services has the effect of liberating resources which then can be used for additional services or for improving existing services.

Less vendor dependencies
The provision of many library services depends on technology, and there only exist a few different vendors in the library business.Often organizations are depending on products by one or two vendors, with significant costs for switching to a different vendor or product.In other words, the lock-in effect is very strong in the library domain.

lobid.org 3.
In August 2010 the hbz launched its experimental Linked (Open) Data service lobid.org 2 , which is comprised of two services: a "catalogue" of bibliographic resources and holding information (lobid-resources 3 ) and an index of libraries and related organizations (lobid-organizations 4 ).lobid.orgfully employs Linked Data principles as well as -whenever possible -Open Data principles.
Since 2010 the two lobid.orgservices and their underlying data have been continuously improved: Information is being extended by adding more fields from legacy data • to the mapping and by revising vocabulary and property choices.
Context is being added by linking resources to other Linked Data sets.• Interaction options for end users are being improved, e.g. by a search engine interface and by aligning the user interface of both sub-services (see 5).
Figure 1 gives an overview over the currently employed technology stack, data sources and conversion processes.As a triple store lobid.orgemploys Garlik's 4store 5 for full-text indexing and elasticsearch 6 for searching.The web front end runs on an Apache server and is generated by the Phresnel framework described in more detail below.
As one can see, lobid.org is currently almost entirely based on legacy data dumps from existing systems that are converted to RDF using custom tools.The resulting RDF data are enriched with links to other datasets in the LOD cloud.Some external LOD datasets are also indexed into the triple store: currently these are the ontologies used within lobid.org as well as the German national authority file Gemeinsame Normdatei (GND) 7 provided by the German National library.Until now, no possibilities exist for manually adding and editing the RDF data, e.g. to add new information (commonly called 'cataloguing') or to correct mistakes.

lobid organisations
When the hbz started to publish Linked Open Data, it became clear that the bibliographic records from the hbz union catalogue would just be the start.If you want to build useful services on top of Linked Open Data, you also need URIs for and descriptions of items, holding institutions and services.For example, a geo-based query which gives you back all items of a specific manifestation in a 5 km radius requires URIs for and RDF descriptions of at least three entities: There is a manifestation M that is exemplified by an item I that is held by organisation O.In the RDF serialization turtle it reads as illustrated in Figure 2.
The corresponding graph looks as in Figure 3.
Since two years ago people and organizations wouldn't move very much if you asked them to provide Linked Open Data, the realization was made that one has to do things yourself.Thus, lobid.orgwas launched with lobidorganizations in July 2010 (Ostrowski, 2010).The underlying data come from the German ISIL registry 8 and the MARC organization code database 9 maintained by the Library of Congress.By now, lobid.orghas minted URIs for more than 40,000 institutions and provides basic RDF descriptions of them. 10Currently there exists neither an openly available dump of the data nor is it openly licensed as we cannot decide on this, not having produced the data ourselves.

Enrichments
In addition to the data obtained from the mentioned data sources, new links to other datasets in the LOD cloud are created.By now, links to DBpedia and Wikipedia (Christoph, 2012d) and to GeoNames (Christoph, 2012a) have been added.Furthermore, organization descriptions are enriched with a QR code which contains contact information (ibid.).Also, based on the geo coordinates for most of the libraries, we show their location on Open Street Map 11 embedded in the web page.
An example web page for the German National Library -generated from underlying RDF as described in 5.1 -is illustrated in Figure 4.

lobid resources
lobid-resources is basically the LOD interface for Open Data from the hbz union catalogue.It offers URIs for and descriptions of bibliographic resources like monographs and multi-volume works on a FRBR-manifestation level as well as URIs for and descriptions of corresponding items held by hbz member libraries.Also, journals and serials are included that cannot be comprised under the FRBR WEMI (work-expression-manifestation-item) model (Pohl, 2011).Information on FRBR expression or especially on work-level is planned to be integrated in the future.
Since the first main open data publication in March 2010 (North Rhine-Westphalian Library Service Center, 2010) gradually more and more data from this catalogue has been published as open data in agreement with cooperating libraries.As of August 2012, the dataset comprises approximately 16 Million records published under a Creative Commons Zero license 13 , which represents 85% of the hbz union catalogue (Christoph, 2012e).Using custom conversion tools, the data are generated based on an XML dump from the hbz Aleph system.The resulting RDF data can be queried via a public SPARQL endpoint 14 , and a full data dump is also available for download 15 .

Enrichments
Because identifiers from the German-wide authority file for names, subject headings and corporate entities already existed in the legacy data, links to the Linked Data version of the Gemeinsame Normdatei (GND, first published in 2010) were easy to implement.Also, links to other datasets that include bibliographic data were added step by step.Using simple matching algorithms for ISBN and title string in combination with some post processing based on simple heuristics, links to Dbpedia (Christoph, 2012b), Open Library (Christoph, 2012c) and Project Gutenberg were added to a subset of resources.These links provide some kind of worklevel bundling of resources, enabling for instance mutual enrichment of bundled resources with subject headings, links etc.
In the future the hbz aims at enhancing the data even more by adding subject headings and classification as well as by providing more links to other datasets and to full texts online.A simple API will be developed to enable easy use by libraries who want to re-use these enrichments An example resource description is depicted in Figure 5.
Presenting RDF data using the Fresnel Display Vocabulary 4.
for RDF

Rationale
In the beginning, the converted legacy data for bibliographic resources was exposed using Pubby, "a Linked Data Frontend for SPARQL Endpoints" 18 .While being very easy to set up, the resulting views -and among those especially the human-readable HTML -that were generated by Pubby exposed too much of the underlying technology.The organizations data on the other hand was presented using a custom SPARQL query, PHP scripts and some HTML-templates in order to, e.g.include a map in the HTML view.This provided more flexibility but was not easily adapted to data other than that about organizations since each content model needed a manually created query and corresponding template.Besides that, there was the idea to enable libraries to easily create RDFa 19 descriptions of their organizations.In order to do so, the need for a simple, intuitive editor arose.Instead of exposing the underlying RDF model to content-creators, a browser based HTML form was aimed at, providing a familiar environment for anybody acquainted with the Web.
With these requirements in the back of the head, the search for a schema language from which such a front-end could be derived began.When dealing with RDF data, RDF Schema (RDFS) or the Web Ontology Language (OWL) are the first candidates that spring to mind.Since ontologies expressed in these languages are usually designed to be application-independent, experiments in this direction were rather fruitless, because the resulting views were too generic to fulfil the requirements.Especially mixing classes and properties from several vocabularies in a concise and comprehensible way is nearly impossible.Luckily, the Fresnel Display Vocabulary for RDF came across.It is designed precisely to specify "what information contained in an RDF graph should be presented and how this information should be presented" 20 without interfering with the underlying ontologies.Similar to the ontology languages mentioned above, it is itself based on RDF, making it possible to stay within one data model all throughout the implementation.

Lenses
Fresnel lenses address the first aspect mentioned above, namely which data should be displayed.A single lens can be related to instances in several ways, the simplest possibility being a reference to its class (i.e. its rdf:type values) as demonstrated in Figure 6.For the selected instances, an ordered list of properties is supplied, which is very easy and readable in turtle notation.In order to include in the output data about a related entity, another lens may be referred to.In the example in Figure 6, the author's first and last name will be displayed in a document description and not only its URI as would be the case when listing dc:creator without referring to such a :person sublens as is done on the right hand side.Applying the above lenses to an (imaginary) triple store should yield the triples in Figure 7, which should be displayed in that order: All in all, Fresnel lenses allow for a very concise and declarative way to express which data to select and which order to display it in.

Formats
Fresnel formats deal with the second issue stated above: they express how the selected data should be displayed.Possibilities range from custom labels for properties that differ from the labels defined in an ontology, to styling hooks used to reference CSS classes.An example can be found in Figure 8.
In a way similar to Fresnel lenses as a way to select and order data, the format vocabulary allows to configure the way that data is displayed in a declarative, application independent way.

Alternatives
Several implementations of Fresnel already existed 21 when the development of Phresnel began.Most of them are written in Java and none is implemented in PHP, upon which the pre-Phresnel version of lobid.orgwas based.Standalone applications supporting Fresnel, such as IsaViz 22 , do not fulfil our requirement of providing a classical browser-based user interface.JFresnel 23 as a low-level Fresnel API is an interesting library to implement Java-based Fresnel-aware applications, but it does not deliver any application logic.Reusing our existing PHP web-application code -such as request dispatching / URL routing -would not have been possible, thus we decided against the usage of this library.Longwell 24 is geared towards faceted browsing, which is indeed an important aspect of a system such as lobid.org.LENA 25 is yet another Linked-data viewer, but both of these solutions do not provide the means to alter data.Also, in both cases the demos are offline and development appears to have stalled.There are further Linked-data front-ends for SPARQL endpoints, such as Pubby 26 and Elda 27 , but these do not use Fresnel and once again do not provide editing capabilities.Thus, a new, PHP-based editing-aware framework dubbed 'Phresnel' 28 was implemented as a proof-of-concept.Currently only a small subset of lens and format features is implemented in Phresnel, limited to those absolutely necessary to get the prototype up and running.

Displaying
In order to display data according to a Fresnel lens, the web application detects the lens to be used from the URL to which a GET request was issued, e.g."document".It then uses the Phresnel framework to generate a generic (X)HTML view (with embedded RDFa) of the requested data as shown in Figure 9.At this point it is assumed that an HTTP-303-redirect following the linked-data design pattern 29 has already occurred in a previous step.
Internally, Phresnel uses the lens definitions to generate a series of SPARQL CONSTRUCT queries such as those depicted in Figure 10 and then uses the lens and format definitions to order and style the resulting RDF.
Currently, only a hard-coded box model (Bizer, Lee, & Pietriga, 2005) based on nested tables is available.The ordering and transformation to (X)HTML can obviously be skipped when a pure RDF representation is requested via content-negotiation.

Editing
When assembling an editing the view of a resource, the steps are very similar to those when requesting a simple display representation, as can be seen in Figure 11.But there is one major difference.When the RDF resulting from the SPARQL queries is transformed to an (X)HTML form, all literals are simply  converted to text input elements.Unfortunately this would result in an awkward interface for those cases where links to other entities are expected, since the URIs of those entities would have to be looked up and typed in manually.This is unacceptable from a usability point of view.
The current implementation solves this in an insufficient, but at least reasonably visually appealing way.Link targets, e.g. a list of possible authors in the above example, are displayed as drop-down lists populated from the triple store.Unfortunately this has two severe limitations: the list can easily get excessively long, and the entries in the list are limited to the own data source, making it hard to link to other sources on the web.For the future, it is planned to get around these limitations by means of auto-complete text-fields that are driven either by custom javascript-snippets for external data-sources or connected to the own search index discussed in the next section.At this point it should be noted again that while data can be edited using this web front-end, currently no work-flow for storing the resulting data is in place.

Search
While the Linked Data paradigm is great for navigation, search is vital to actually discover resources.Although SPARQL supports regular expressions 30 that can be used for search, this currently does not scale to the data volume of lobid.org.This is why the search for organizations 31 as well as the one for resources 32 on lobid.org is currently backed by an elasticsearch 33 index, accessed via a custom web application that provides a CQL interface.This concrete setup has historical reasons.The resource index has existed long before lobid.organd is used by several hbz services, and it was easiest to simply integrate the organization index into the same infrastructure.
Due to this construction, search results -which are received as Atom feeds -have a data structure that does not match the Fresnel lens definitions driving lobid.org.Hence, currently only the identifiers are extracted from the search result, which are used to construct the URI of the discovered resource, using Phresnel to then receive and format data as described above.This works fairly well and is much more efficient than using native SPARQL queries for fulltext search but suffers from the fact that after the resource has been identified the data about it has to be retrieved again, this time from the triple store.This is only one of the problems that will have to be tackled next.

Prospects 6.
The current proof-of-concept implementation for a read/write system for LOD-based library data uncovers interesting prospects for future data management.It is clear though that the efforts are still very much at the beginning.Further Phresnel developments will explore options in the following areas.

Purely JavaScript based editing
Since the most important part of the editor -the facility to look up resources and create links -will need to be reworked for reasons mentioned above, it is considered to switch to an editor implementation based solely on RDFa and JavaScript in a way similar to create.js 34, abandoning server-generated forms all together.This would not only create a more fluid experience for the user but would also reduce load on the server and provide a better separation of concerns between the front-end and the back-end.

Data production and maintenance
The results of editing data in the web front-end will have to become persistent.There are several non-trivial decisions that have to be made in this respect: in how many places should the data be stored (triple store, flat files, search index), how should the data be organized (named graphs (Dodds, 2009), …), which provenance should be recorded, which authorization system should be used etc.Since an application like the one described in this paper naturally lives in a networked, decentralized environment, thought will also have to be put into a solution to inform other connected services about creation, updates and deletions of data. 35One idea is to use a real-time, message based protocol such as IRC or XMPP. 36  Another important feature regarding data management is obviously identification and authentication of the agents acting upon the data.Instead of implementing such a system from scratch, it should be based upon a standard and ideally it should also be based on Linked Data principles.Because of this, the most likely approach to be used is WebID (Sporny, Inkster, Story, Harbulot, & Bachmann-Gmür, 2011) which uses FOAF descriptions of agents in conjunction with SSL certificates.This results in a secure distributed identification and authentication mechanism that is reasonably easy to be used by humans as well as by machines.

Versioning
Among the most important provenance information is a seamless history of changes made to the data, along with the identification of the agent (be it a system or person) that is responsible for these changes.While it is possible to express changes to RDF data as change-sets using an RDF 37 vocabulary, it is very likely that this is not the most efficient way to store them since the triple count in a store would explode (at least if it is the same store that holds the actual data).Alternatives that are on the map to be explored are using a versioning system that operates on flat files, such as git 38 , or using the versioning features of elasticsearch, which cannot only be considered as a search engine but also as a document store.In order to expose the different versions of the data in a standardized way, it is being considered to implement a Memento 39 interface for the selected versioning system.

JSON-LD in ES / Fresnel-based search engine indexing
The way in which search is currently tied into the application is not very generic, and it depends on external organizational and technical processes.Since look-up is a very important part of the system, a solution that ties in more naturally is preferred.Elasticsearch being schema-less should play nicely with RDF data.Since it indexes JSON data, an evaluation of several JSON-RDF-serialisations has begun.The most promising approach seems to be JSON-LD 40 , on the one hand because it is the one most likely to become a standard, and on the other hand because it structures data in a way that matches the key-value approach that elasticsearch is based upon.The Fresnel lenses driving the front-end could be reused to generate the JSON-structures for the index.
and the Conference of European National Librarians' affirmation of open licensing for their data (Conference of European National Librarians, 2011) being only two examples.