Automating subject indexing at ZBW: making research results stick in practice

Anna Kasprzik

doi:10.53377/lq.13579

Authors

Anna Kasprzik ZBW Leibniz Information Centre for Economics

DOI:

https://doi.org/10.53377/lq.13579

Keywords:

subject indexing, automation, machine learning, artificial intelligence, metadata, IT infrastructure

Abstract

Subject indexing, i.e., the enrichment of metadata records for textual resources with descriptors from a controlled vocabulary, is one of the core activities of libraries. Due to the proliferation of digital documents, it is no longer possible to annotate every single document intellectually, which is why we need to explore the potentials of automation on every level.

At ZBW the efforts to partially or completely automate the subject indexing process started as early as 2000 with experiments involving external partners and commercial software. The conclusion of that first exploratory period was that commercial, supposedly shelf-ready solutions would not suffice to cover the requirements of the library. In 2014 the decision was made to start doing the necessary applied research in-house which was successfully implemented by establishing a PhD position. However, the prototypical machine learning solutions that they developed over the following years were yet to be integrated into productive operations at the library. Therefore in 2020 an additional position for a software engineer was established and a pilot phase was initiated (planned to last until 2024) with the goal to complete the transfer of our solutions into practice by building a suitable software architecture that allows for real-time subject indexing with our trained models and the integration thereof into the other metadata workflows at ZBW.

In this paper we address the question of how to transfer results from applied research into a productive service, and we report on the milestones we have reached so far and on those that are yet to be reached on an operational level. We also discuss the challenges we were facing on a strategic level, the measures and resources (computing power, software, personnel) that were needed in order to be able to affect the transfer, and those that will be necessary in order to subsequently ensure the continued availability of the architecture and to enable a continuous development during running operations.

We conclude that there are still no shelf-ready open source systems for the automation of subject indexing – existing software has to be adapted and maintained continuously which requires various forms of expertise. However, the task of automation is here to stay, and librarians are witnessing the dawn of a new era where subject indexing is done at least in part by machines, and the respective roles of machines and human experts may shift even further and more rapidly in a not-so-distant future. We argue that in general, the format of “project” and the mindset that goes with it may not suffice to secure the commitment that an institution and its decision-makers and the library community as a whole will have to bring to the table in order to face the monumental task of the digital transformation and automation in the long run. We also highlight the importance of all parties – applied researchers, software engineers, stakeholders – staying involved and continuously communicating requirements and issues back and forth in order to successfully create and establish a productive service that is suitable and equipped for operation.