Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods

In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowdsourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements.

[1]  Klaus U. Schulz,et al.  A visual and interactive tool for optimizing lexical postcorrection of OCR results , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[2]  Tommi A. Pirinen,et al.  HFST Tools for Morphology - An Efficient Open-Source Package for Construction of Morphological Analyzers , 2009, SFCM.

[3]  Zeeshan Bhatti,et al.  Phonetic based SoundEx & ShapeEx algorithm for Sindhi Spell Checker System , 2014, ArXiv.

[4]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[5]  O. J. Vrieze,et al.  Kohonen Network , 1995, Artificial Neural Networks.

[6]  Gregory R. Crane,et al.  The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[7]  Edwin R. Hancock,et al.  Discovering Shape Classes using Tree Edit-Distance and Pairwise Clustering , 2007, International Journal of Computer Vision.

[8]  Ismo Raitanen "Etsikäät hywää ja älläät pahaa." Tiedonhakumenetelmien tuloksellisuuden vertailu merkkivirheitä sisältävässä historiallisessa sanomalehtikokoelmassa , 2012 .

[9]  Daniel X. Le,et al.  Pattern matching techniques for correcting low-confidence OCR words in a known context , 2000, IS&T/SPIE Electronic Imaging.

[10]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[11]  Hartmut Walravens A NORDIC DIGITAL NEWSPAPER LIBRARY , 2006 .

[12]  Klaus U. Schulz,et al.  On lexical resources for digitization of historical documents , 2009, DocEng '09.

[13]  Jilei Tian,et al.  n-gram and decision tree based language identification for written words , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[14]  Leonid Boytsov,et al.  Indexing methods for approximate dictionary searching: Comparative analysis , 2011, JEAL.

[15]  Simon Tanner,et al.  Measuring Mass Text Digitization Quality and Usefulness , 2009 .

[16]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Otto Chrons,et al.  Digitalkoot: Making Old Archives Accessible Using Crowdsourcing , 2011, Human Computation.

[19]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[20]  Joseph P. Turian,et al.  Evaluation of machine translation and its evaluation , 2003, MTSUMMIT.

[21]  Timo Honkela,et al.  A Language-Independent Approach to Keyphrase Extraction and Evaluation , 2008, COLING.

[22]  Jaakko J. Väyrynen,et al.  WordICA—emergence of linguistic representations for words by independent component analysis , 2010, Natural Language Engineering.

[23]  Kimmo Kettunen,et al.  Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?* , 2014, J. Quant. Linguistics.

[24]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[25]  Simon Tanner,et al.  Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archive , 2009, D Lib Mag..

[26]  Tommi Vatanen,et al.  Language Identification of Short Text Segments with N-gram Models , 2010, LREC.

[27]  Klaus U. Schulz,et al.  Adaptive text correction with Web-crawled domain-dependent dictionaries , 2007, TSLP.

[28]  Hartmut Walravens CONNECTING TO THE PAST – NEWSPAPER DIGITISATION IN THE NORDIC COUNTRIES , 2006 .