Extended Named Entities Annotation on OCRed Documents: From Corpus Constitution to Evaluation Campaign

Within the framework of the Quaero project, we proposed a new definition of named entities, based upon an extension of the coverage of named entities as well as the structure of those named entities. In this new definition, the extended named entities we proposed are both hierarchical and compositional. In this paper, we focused on the annotation of a corpus composed of press archives, OCRed from French newspapers of December 1890. We present the methodology we used to produce the corpus and the characteristics of the corpus in terms of named entities annotation. This annotated corpus has been used in an evaluation campaign. We present this evaluation, the metrics we used and the results obtained by the participants.

[1]  Eduard H. Hovy,et al.  Fine Grained Classification of Named Entities , 2002, COLING.

[2]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[3]  Michael Fleischman Automated Subcategorization of Named Entities , 2001, ACL.

[4]  Richard M. Schwartz,et al.  Named Entity Extraction from Noisy Input: Speech and OCR , 2000, ANLP.

[5]  Satoshi Sekine,et al.  Definition, Dictionaries and Tagger for Extended Named Entity Hierarchy , 2004, LREC.

[6]  Olivier Galibert,et al.  Proposal for an Extension of Traditional Named Entities: From Guidelines to Evaluation, an Overview , 2011, Linguistic Annotation Workshop.

[7]  Gregory R. Crane,et al.  The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[8]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[9]  Olivier Galibert,et al.  Named and Specific Entity Detection in Varied Data: The Quæro Named Entity Baseline Evaluation , 2010, LREC.

[10]  Olivier Galibert,et al.  Structured and Extended Named Entity Evaluation in Automatic Speech Transcriptions , 2011, IJCNLP.

[11]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[12]  Eckhard Bick A Named Entity Recognizer for Danish , 2004, LREC.

[13]  Guillaume Gravier,et al.  The ester 2 evaluation campaign for the rich transcription of French radio broadcasts , 2009, INTERSPEECH.

[14]  Sam Coates-Stephens,et al.  The Analysis and Acquisition of Proper Names for the Understanding of Free Text , 1992, Comput. Humanit..

[15]  Claire Grover,et al.  Named Entity Recognition for Digitised Historical Texts , 2008, LREC.

[16]  Kate Byrne Nested Named Entity Recognition in Historical Archive Text , 2007, International Conference on Semantic Computing (ICSC 2007).