Diachronic Evaluation of NER Systems on Old Newspapers

In recent years, many cultural institutions have engaged in large-scale newspaper digitization projects and large amounts of historical texts are being acquired (via transcription or OCRization). Beyond document preservation, the next step consists in providing an enhanced access to the content of these digital resources. In this regard, the processing of units which act as referential anchors, namely named entities (NE), is of particular importance. Yet, the application of standard NE tools to historical texts faces several challenges and performances are often not as good as on contemporary documents. This paper investigates the performances of different NE recognition tools applied on old newspapers by conducting a diachronic evaluation over 7 time-series taken from the archives of Swiss newspaper Le Temps.

[1]  Claire Grover,et al.  Named Entity Recognition for Digitised Historical Texts , 2008, LREC.

[2]  Marc B. Vilain,et al.  Entity Extraction is a Boring Solved Problem - Or is it? , 2007, HLT-NAACL.

[3]  Caroline Sporleder,et al.  Natural Language Processing for Cultural Heritage Domains , 2010, Lang. Linguistics Compass.

[4]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[5]  Piskorski Jakub,et al.  ExPRESS - Extraction Pattern Recognition Engine and Specification Suite , 2007 .

[6]  Mark Dredze,et al.  Entity Linking: Finding Extracted Entities in a Knowledge Base , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[7]  Olivier Galibert,et al.  Named and Specific Entity Detection in Varied Data: The Quæro Named Entity Baseline Evaluation , 2010, LREC.

[8]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[9]  Olivier Galibert,et al.  Extended Named Entities Annotation on OCRed Documents: From Corpus Constitution to Evaluation Campaign , 2012, LREC.

[10]  Apostolos Antonacopoulos,et al.  Making Europe's Historical Newspapers Searchable , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[11]  Bruno Pouliquen,et al.  JRC-NAMES: A Freely Available, Highly Multilingual Named Entity Resource , 2011, RANLP.

[12]  Damien Nouvel,et al.  Named Entity Resources - Overview and Outlook , 2016, LREC.

[13]  Rik Van de Walle,et al.  Exploring entity recognition and disambiguation for cultural heritage collections , 2015, Digit. Scholarsh. Humanit..

[14]  Damien Nouvel,et al.  Pattern Mining for Named Entity Recognition , 2011, LTC.

[15]  Sophie Rosset,et al.  Tree-Structured Named Entity Recognition on OCR Data: Analysis, Processing and Results , 2012, LREC.

[16]  Tobias Blanke,et al.  Comparison of named entity recognition tools for raw OCR text , 2012, KONVENS.

[17]  Fabian M. Suchanek,et al.  Mining history with Le Monde , 2013, AKBC '13.

[18]  Mark Dredze,et al.  OOV Sensitive Named-Entity Recognition in Speech , 2011, INTERSPEECH.

[19]  Nicola Ringland Structured Named Entities , 2015 .

[20]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[21]  Olivier Galibert,et al.  The ETAPE speech processing evaluation , 2014, LREC.

[22]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[23]  Adrian Bingham,et al.  ‘The Digitization of Newspaper Archives: Opportunities and Challenges for Historians’ , 2010 .

[24]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[25]  John Nerbonne,et al.  Proceedings of the 13th Conference on Natural Language Processing, KONVENS 2016, Bochum, Germany, September 19-21, 2016 , 2016, KONVENS.

[26]  Bruno Pouliquen,et al.  An introduction to the Europe Media Monitor family of applications , 2013, ArXiv.

[27]  Michele Barbera,et al.  Dandelion: from raw data to dataGEMs for developers , 2014, ISWC Developers Workshop.

[28]  Olivier Galibert,et al.  Structured Named Entities in two distinct press corpora: Contemporary Broadcast News and Old Newspapers , 2012, LAW@ACL.

[29]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[30]  Lars Borin,et al.  Naming the Past: Named Entity and Animacy Recognition in 19th Century Swedish Literature , 2007, LaTeCH@ACL 2007.

[31]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[32]  Olivier Galibert,et al.  Structured and Extended Named Entity Evaluation in Automatic Speech Transcriptions , 2011, IJCNLP.