Ground Truth OCR Sample Data of Finnish Historical Newspapers and Journals in Data Improvement Validation of a re-OCRing Process

Authors

  • Kimmo Kettunen University of Helsinki, National Library of Finland, FI
  • Mika Koistinen University of Helsinki, National Library of Finland, FI
  • Jukka Kervinen University of Helsinki, FI

DOI:

https://doi.org/10.18352/lq.10322

Keywords:

OCR quality, Finnish historical newspapers, measurement, evaluation, ground truth data

Abstract

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 16.51 million pages mainly in Finnish and Swedish. Out of these about 7.64 million pages are freely available on the web site https://digi.kansalliskirjasto.fi/etusivu. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The last nine years, 1921–1929, were opened in January 2018.

This paper presents briefly the ground truth Optical Character Recognition data of about 500 000 words that has been compiled at the NLF for development of an improved OCR process for the Finnish collection. We discuss compilation of the data generally and show results of the new OCR process in comparison to current OCR, using the ground truth data as an evaluation benchmark. We also show with real newspaper data of 30 years and 109 million words that the re-OCRing process is improving the quality of the OCRed data.

Downloads

Download data is not yet available.

Published

2020-02-04

How to Cite

Kettunen, K., Koistinen, M., & Kervinen, J. (2020). Ground Truth OCR Sample Data of Finnish Historical Newspapers and Journals in Data Improvement Validation of a re-OCRing Process. LIBER Quarterly: The Journal of the Association of European Research Libraries, 30(1), 1–20. https://doi.org/10.18352/lq.10322

Issue

Section

Case studies
Received 2021-06-16
Published 2020-02-04