References

Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 25(11), 120–125.

Breuel, T. M. (2008). The OCRopus open source OCR system. Proc. SPIE 6815, Document Recognition and Retrieval XV, 68150F. Electronic Imaging 2005, San Jose, California, USA. https://doi.org/10.1117/12.783598

Carrasco, R. C. (2014). An open-source OCR evaluation tool. Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage – DATeCH’ 14 (pp. 179–184). https://doi.org/10.1145/2595188.2595221

Hegghammer, T. (2022). OCR with Tesseract, Amazon Textract, and Google Document AI: A benchmarking experiment. Journal of Computational Social Science, 5(1), 861–882. https://doi.org/10.1007/s42001-021-00149-1

Kettunen, K., Koistinen, M., & Kervinen, J. (2020). Ground truth OCR sample data of Finnish historical newspapers and journals in data improvement validation of a re-OCRing process. LIBER Quarterly, 30(1). https://doi.org/10.18352/lq.10322

Kiessling, B. (2019). Kraken - a universal text recognizer for the humanities. Digital Humanities Conference 2019 (DH2019). https://doi.org/10.34894/Z9G2EX

Levenshtein, V. (1965). Binary codes capable of correcting spurious insertions and deletions of ones. Problems of Information Transmission, 1, 8–17.

Luxemburger Wort, (1942). Neues Kleid. Luxemburger Wort, 2.3.1942(61), 1. https://persist.lu/ark:70795/g3vmw4/pages/1/articles/DTL47

Maurer, Y. (2017). Improving the quality of the text, a pilot project to assess and correct the OCR in a multilingual environment. Relying on News Media. Long Term Preservation and Perspectives for Our Collective Memory. https://nbn-resolving.org/urn:nbn:de:bsz:14-qucosa2-164455

Neudecker, C., Baierer, K., Federbusch, M., Boenig, M., Würzner, K.-M., Hartmann, V., & Herrmann, E. (2019). OCR-D: An end-to-end open source OCR framework for historical printed documents. Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage (pp. 53–58). https://doi.org/10.1145/3322905.3322917

Nguyen, T. T. H., Jatowt, A., Nguyen, N.-V., Coustaty, M., & Doucet, A. (2020). Neural machine translation with BERT for post-OCR error detection and correction. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (pp. 333–336). https://doi.org/10.1145/3383583.3398605

Schneider, P. (2021). Combining morphological and histogram based text line segmentation in the OCR Context. Journal of Data Mining & Digital Humanities, 2021 (HistoInformatics). https://doi.org/10.46298/jdmdh.7277

Schneider, P., & Maurer Y. (2022). Rerunning OCR - A machine learning approach to quality assessment and enhancement prediction. Journal of Data Mining and Digital Humanities. https://doi.org/10.46298/jdmdh.8561

Smith, R. (2007). An overview of the Tesseract OCR engine. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba Brazil (pp. 629–633). https://doi.org/10.1109/icdar.2007.4376991

Soper, E., Fujimoto, S., & Yu, Y.-Y. (2021). BART for post-correction of OCR newspaper text. Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-NUT 2021) (pp. 284–290). https://doi.org/10.18653/v1/2021.wnut-1.31

The Luxembourg Government. (n.d). The AI4gov initiative. Retrieved November 2, 2022, from https://gouvernement.lu/en/dossiers.gouv_digitalisation%2Ben%2Bdossiers%2B2021%2BAI4Gov.html

Van de Camp, M. (2008). Explorations into unsupervised corpus quality assessment (Doctoral dissertation. Tilburg Univiersity, The Netherlands). Retrieved November 9, 2022, from http://ilk.uvt.nl/downloads/pub/papers/hait/camp2008.pdf