Old Content and Modern Tools - Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first large scale trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Experiments, results and discussion of this research serve development of the Web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from years 1771-1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70-75 % (Kettunen and P\"a\"akk\"onen, 2016). Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. Three other tools are also evaluated briefly. This research reports first published large scale results of NER in a historical Finnish OCRed newspaper collection. Results of the research supplement NER results of other languages with similar noisy data.

[1]  Marcia J. Bates,et al.  What is browsing - really? A model drawing from behavioural science research , 2007, Inf. Res..

[2]  Beatrice Alex,et al.  Estimating and rating the quality of optically character recognised text , 2014, DATeCH '14.

[3]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[4]  Clemens Neudecker,et al.  Large-scale refinement of digital historic newspapers with named entity recognition , 2014 .

[5]  Paul Rayson,et al.  A semantic tagger for the Finnish language , 2005 .

[6]  Richard M. Schwartz,et al.  Named Entity Extraction from Noisy Input: Speech and OCR , 2000, ANLP.

[7]  Elaine Toms,et al.  Understanding and facilitating the browsing of electronic text , 2000, Int. J. Hum. Comput. Stud..

[8]  Timo Honkela,et al.  Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods , 2014 .

[9]  Sunghwan Mac Kim,et al.  Finding Names in Trove: Named Entity Recognition for Australian Historical Newspapers , 2015, ALTA.

[10]  Clemens Neudecker,et al.  An Open Corpus for Named Entity Recognition in Historic Newspapers , 2016, LREC.

[11]  Miikka Silfverberg,et al.  Data-Driven Spelling Correction using Weighted Finite-State Methods , 2016, ACL 2016.

[12]  Kimmo Kettunen,et al.  Measuring Lexical Quality of a Historical Finnish Newspaper Collection ― Analysis of Garbled OCR Data with Basic Language Technology Tools and Means , 2016, LREC.

[13]  Anthony McEnery,et al.  The UCREL Semantic Analysis System , 2004 .

[14]  Tobias Blanke,et al.  Comparison of named entity recognition tools for raw OCR text , 2012, KONVENS.

[15]  David W. Embley,et al.  Extracting person names from diverse and noisy OCR text , 2010, AND '10.

[16]  Eero Hyvönen,et al.  Representing and Utilizing Changing Historical Places as an Ontology Time Series , 2011, Geospatial Semantics and the Semantic Web.

[17]  Damien Nouvel,et al.  Named Entity Resources - Overview and Outlook , 2016, LREC.

[18]  Daniel P. Lopresti,et al.  Optical character recognition errors and their effects on natural language processing , 2008, AND '08.

[19]  Lars Borin,et al.  HFST-SweNER — A New NER Resource for Swedish , 2014, LREC.

[20]  Mónica Marrero,et al.  Named Entity Recognition: Fallacies, challenges and opportunities , 2013, Comput. Stand. Interfaces.

[21]  Thierry Poibeau,et al.  Proper Name Extraction from Non-Journalistic Texts , 2000, CLIN.

[22]  Tommi A. Pirinen,et al.  HFST - A System for Creating NLP Tools , 2013, SFCM.

[23]  Kimmo Kettunen,et al.  Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach , 2016, J. Assoc. Inf. Sci. Technol..

[24]  Eetu Mäkelä Combining a REST Lexical Analysis Web Service with SPARQL for Mashup Semantic Annotation from Text , 2014, ESWC.

[25]  Kimmo Kettunen,et al.  Exporting Finnish Digitized Historical Newspaper Contents for Offline Use , 2016, D Lib Mag..

[26]  Juan Trujillo,et al.  Current state of Linked Data in digital libraries , 2016, J. Inf. Sci..

[27]  Eero Hyvönen,et al.  Contextualizing Historical Places in a Gazetteer by Using Historical Maps and Linked Data , 2016, DH.

[28]  Sven Laur,et al.  Named Entity Recognition in Estonian , 2013, BSNLP@ACL.

[29]  Dawn Archer,et al.  Lexical Coverage Evaluation of Large-scale Multilingual Semantic Lexicons for Twelve Languages , 2016, LREC.

[30]  Gregory R. Crane,et al.  The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[31]  Christine D. Piatko,et al.  Processing Named Entities in Text , 2011 .

[32]  Kimmo Kettunen,et al.  Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers , 2016, Baltic HLT.