Automatic Subject Cataloguing at the German National Library
DOI:
https://doi.org/10.53377/lq.19422Keywords:
German National Library; automatic classification; automatic indexing; natural language processing; machine learningAbstract
The German National Library (DNB) began developing solutions for automatic subject cataloguing 15 years ago. The main reason for this was the huge and ever-growing number of digital media works that needed to be indexed. Today, the DNB uses open source algorithms and frameworks to assign various types of thematic meta information in this way.
This practice paper provides a deeper insight into automatic subject cataloguing at the DNB. We look at the data and vocabularies used as well as at the different methods and approaches. The vocabulary for classification is based on the Dewey Decimal Classification (DDC). For verbal subject indexing we use the German Integrated Authority File (GND).
The use case of automatic classification is divided into the assignment of DDC Subject Categories and DDC Short Numbers. Due to the large size of the GND vocabulary, the use case of automatic indexing is an extreme multi-label classification (XMLC) problem. A brief report is given about the construction and the performance of our models.
Based on these use cases, we present some implementation aspects of our “subject cataloguing machine” EMa, the environment for automatic subject cataloguing in productive use. We point out the basic feature set and provide a high-level introduction of the productive EMa system. The modular design of the EMa software architecture with the open source software Annif as a central toolkit is described.
The development of EMa is an ongoing task at the DNB. It requires continuous development and maintenance, technological and human resources. Applied research activities in the DNB's AI project are closely related to the EMa ensuring that relevant scientific findings get integrated into its development.
Downloads
References
Boyd, K., Eng, K., & Page, C. (2013). Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals. In H. Blockeel, K. Kersting, S. Nijssen & F. Železný (Eds.), Lecture Notes in Computer Science: Vol. 8190. Machine Learning and Knowledge Discovery in Databases (pp. 451–466). Springer. https://doi.org/10.1007/978-3-642-40994-3_29
Dasgupta, A., Katyan, S., Das, S., & Kumar, P. (2023). Review of Extreme Multilabel Classification. arXiv. https://doi.org/10.48550/arXiv.2302.05971
Decision tree learning. (2024, July 16). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Decision_tree_learning&oldid=1234846759
F-Score. (2024, July 24). In Wikipedia. https://en.wikipedia.org/w/index.php?title=F-score&oldid=1236366682
German National Library. (2021, February 24). DDC at the German National Library. https://www.dnb.de/EN/Professionell/DDC-Deutsch/DDCinDNB/ddcindnb_node.html
German National Library. (2022, October 13). Automatic cataloguing system. https://www.dnb.de/EN/Professionell/ProjekteKooperationen/Projekte/KI/KI.html
German National Library. (2023, September 19) Launch of Cataloguing Machine EMa. https://jahresbericht.dnb.de/Webs/jahresbericht/EN/2022/Hoehepunkte/Erschliessungsmaschine/erschliessungsmaschine_node.html
German National Library. (2024). Annual Report 2023. https://jahresbericht.dnb.de/Webs/jahresbericht/EN/2023/Home/home_node.html
Golub, K., Soergel, D., Buchanan, G., Tudhope, D., Lykke, M., & Hiom, D. (2016). A Framework for Evaluating Automatic Indexing or Classification in the Context of Retrieval. Journal of the Association for Information Science and Technology, 67(1), 3-16. https://doi.org/10.1002/asi.23600
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. The MIT Press.
Inkinen, J. (2023, January 20). Transforms. Github Annif. https://github.com/NatLibFi/Annif/wiki/Transforms
Inkinen, J. (2024, October 3). Annif Wiki. Github Annif. https://github.com/NatLibFi/Annif/wiki
Jain, H., Prabhu, Y., & Varma, M. (2016). Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications. KDD. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 935-944. https://doi.org/10.1145/2939672.2939756
Junger, U. (2014). Can Indexing Be Automated? The Example of the Deutsche Nationalbibliothek. Cataloging & Classification Quarterly, 52(1), 102-109. https://doi.org/10.1080/01639374.2013.854127
Junger, U. (2018, August 24 - 30). Automation first – the subject cataloguing policy of the Deutsche Nationalbibliothek [Conference paper]. IFLA WLIC 2018 – Transform Libraries, Transform Societies in Session 115 - Subject Analysis and Access. Kuala Lumpur, Malaysia. https://library.ifla.org/id/eprint/2213/1/115-junger-en.pdf
Kasprzik, A. (2023). Automating subject indexing at ZBW - making research results stick in practice. LIBER Quarterly, 33(1). https://doi.org/10.53377/lq.13579
Mödden, E. (2022). Artificial Intelligence, Machine Learning and Bibliographic Control. DDC Short Numbers - Towards Machine-Based Classifying. JLIS.it, 13(1), 256-264. https://doi.org/10.4403/jlis.it-12775
Mödden, E. (2024, December 23). Netzwerk maschinelle Verfahren in der Erschliessung. Deutsche Nationalbibliothek - Wiki. https://wiki.dnb.de/display/FNMVE
Mödden, E., Schöning-Walter, C., & Uhlmann, S. (2018). Maschinelle Inhaltserschließung in der Deutschen Nationalbibliothek. Forum Buch und Bibliothek, 70(1), 30-35.
Monarch, R. (2021). Human-in-the-Loop Machine Learning - Active learning and annotation for human-centered AI. Manning.
Normalised Discounted Cumulative Gain. (2024, May 12). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Discounted_cumulative_gain&oldid=1223546723
Poley, C. (2022, November 28 – December 2). Insight into the machine-based subject cataloguing at the German National Library. [Conference presentation]. SWIB22 Online Conference - 14th Semantic Web in Libraries Conference. https://swib.org/swib22/slides/20221201_poley_dnb_final.pdf
Precision and recall. (2024, October 2). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Precision_and_recall&oldid=1249020015
Schöning-Walter, C. (2010). PETRUS – Prozessunterstützende Software Für Die Digitale Deutsche Nationalbibliothek. Dialog mit Bibliotheken 22(1), 15-19. https://nbn-resolving.org/urn:nbn:de:101-2011012844
Serrano, L. G. (2021). Grokking Machine Learning. Manning.
Stahl, P. M. (2024). Lingua (Version 1.2.2) [Computer software]. https://github.com/pemistahl/lingua
Suominen, O. (2019). Annif - DIY automated subject indexing using multiple algorithms. LIBER Quarterly, 29(1), 1-25. https://doi.org/https://doi.org/10.18352/lq.10285
Suominen, O., Inkinen, J., Virolainen, T., Fürneisen, M., Kinoshita, B. P., Veldhoen, S., Sjöberg, M., Zumstein, P., Neatherway, R., & Lehtinen, M. (2024). Annif (Version 1.1.0) [Computer software]. National Library of Finland. https://doi.org/10.5281/zenodo.2578948
Support vector machine. (2024, August 26). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Support_vector_machine&oldid=1242363493
Toepfer, M., & Seifert, C. (2020). Fusion architectures for automatic subject indexing under concept drift. Analysis and empirical results on short texts. International Journal on Digital Libraries 21, 169-189. https://doi.org/10.1007/S00799-018-0240-3
Uhlmann, S. (2013). Automatische Beschlagwortung von deutschsprachigen Netzpublikationen mit dem Vokabular der Gemeinsamen Normdatei (GND). Dialog mit Bibliotheken 25(2), 26-36. https://nbn-resolving.org/urn:nbn:de:101-20140305238
Uhlmann, S., & Grote, C. (2021, November 29 – December 3). Automatic subject indexing with Annif at the German National Library (DNB) [Conference presentation]. SWIB21 Online Conference - 13th Semantic Web in Libraries Conference. https://swib.org/swib21/slides/03-02-uhlmann.pdf
Wagner, N. (2024). pica-rs (Version 0.25.0) [Computer software]. German National Library. https://github.com/deutsche-nationalbibliothek/pica-rs

Published
Issue
Section
License
Copyright (c) 2025 Christoph Poley, Sandro Uhlmann, Frank Busse, Jan-Helge Jacobs, Maximilian Kähler, Matthias Nagelschmidt, Markus Schumacher

This work is licensed under a Creative Commons Attribution 4.0 International License.