Exploring the Quality of the Digital Historical Newspaper Archive KubHist

The KubHist Corpus is a massive corpus of Swedish historical newspapers, digitized by the Royal Swedish library, and available through the Språkbanken corpus infrastructure Korp. This paper contains a first overview of the KubHist corpus, exploring some of the difficulties with the data, such as OCR errors and spelling variation, and discussing possible paths for improving the quality and the searchability.

[1]  Rico Sennrich,et al.  Strategies for Reducing and Correcting OCR Errors , 2011, Language Technology for Cultural Heritage.

[2]  Dana Dannélls,et al.  Evaluation and refinement of an enhanced OCR process for mass digitisation , 2019, DHN.

[3]  Markus Forsberg,et al.  SALDO: a touch of yin to WordNet’s yang , 2013, Lang. Resour. Evaluation.

[4]  Timo Honkela,et al.  Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods , 2014 .

[5]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[6]  Markus Forsberg,et al.  Korp — the corpus infrastructure of Språkbanken , 2012, LREC.

[7]  Simon Clematide,et al.  Crowdsourcing an OCR Gold Standard for a German and French Heritage Corpus , 2016, LREC.

[8]  Markus Forsberg,et al.  Something Old , Something New : A Computational Morphological Description of Old Swedish , 2008 .

[9]  Timothy Baldwin,et al.  Word Sense Induction for Novel Sense Detection , 2012, EACL.

[10]  Markus Forsberg,et al.  A Diachronic Computational Lexical Resource for 800 Years of Swedish , 2011, Language Technology for Cultural Heritage.

[11]  Annalina Caputo,et al.  Diachronic Analysis of the Italian Language exploiting Google Ngram , 2016, CLiC-it/EVALITA.

[12]  Thomas Risse,et al.  Finding Individual Word Sense Changes and their Delay in Appearance , 2017, RANLP.

[13]  Steve Cassidy Publishing the Trove Newspaper Corpus , 2016, LREC.

[14]  A. F. Dalin Ordbok öfver svenska språket , 1850 .

[15]  Steven Skiena,et al.  Statistically Significant Detection of Linguistic Change , 2014, WWW.

[16]  Susanne Vejdemo,et al.  Triangulating Perspectives on Lexical Replacement : From Predictive Statistical Models to Descriptive Color Linguistics , 2017 .