Entity Linking for Historical Documents: Challenges and Solutions

Named entities (NEs) are among the most relevant type of information that can be used to efficiently index and retrieve digital documents. Furthermore, the use of Entity Linking (EL) to disambiguate and relate NEs to knowledge bases, provides supplementary information which can be useful to differentiate ambiguous elements such as geographical locations and peoples’ names. In historical documents, the detection and disambiguation of NEs is a challenge. Most historical documents are converted into plain text using an optical character recognition (OCR) system at the expense of some noise. Documents in digital libraries will, therefore, be indexed with errors that may hinder their accessibility. OCR errors affect not only document indexing but the detection, disambiguation, and linking of NEs. This paper aims at analysing the performance of different EL approaches on two multilingual historical corpora, CLEF HIPE 2020 (English, French, German) and NewsEye (Finnish, French, German, Swedish), while proposes several techniques for alleviating the impact of historical data problems on the EL task. Our findings indicate that the proposed approaches not only outperform the baseline in both corpora but additionally they considerably reduce the impact of historical document issues on different subjects and languages.

[1]  Jean-Gabriel Ganascia,et al.  Unsupervised Named Entity Recognition and Disambiguation: An Application to Old French Journals , 2014, ICDM.

[2]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[3]  Max De Wilde Improving Retrieval of Historical Content with Entity Linking , 2015, ADBIS.

[4]  Antoine Doucet,et al.  Linking Named Entities across Languages using Multilingual Word Embeddings , 2020, JCDL.

[5]  Francesca Frontini,et al.  REDEN: Named Entity Linking in Digital Literary Editions Using Linked Data Sets , 2016, Complex Syst. Informatics Model. Q..

[6]  Alexandre Gefen Les enjeux épistémologiques des humanités numériques , 2015 .

[7]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[8]  Thierry Poibeau,et al.  Mapping the Bentham Corpus: Concept-based Navigation , 2019 .

[9]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[10]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[11]  Graham Neubig,et al.  Towards Zero-resource Cross-lingual Entity Linking , 2019, EMNLP.

[12]  Rik Van de Walle,et al.  Exploring entity recognition and disambiguation for cultural heritage collections , 2015, Digit. Scholarsh. Humanit..

[13]  Thomas Hofmann,et al.  Deep Joint Entity Disambiguation with Local Neural Attention , 2017, EMNLP.

[14]  Thomas Hofmann,et al.  End-to-End Neural Entity Linking , 2018, CoNLL.

[15]  Francesca Frontini,et al.  Semantic Web based Named Entity Linking for Digital Humanities and Heritage Texts , 2015, SW4SHD@ESWC.

[16]  Gerhard Weikum,et al.  YAGO 4: A Reason-able Knowledge Base , 2020, ESWC.

[17]  Séamus Lawless,et al.  Investigating Entity Linking in Early English Legal Documents , 2018, JCDL.

[18]  Simon Clematide,et al.  Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers , 2020, ECIR.

[19]  Jaime G. Carbonell,et al.  Zero-shot Neural Transfer for Cross-lingual Entity Linking , 2018, AAAI.

[20]  Gregory R. Crane,et al.  Disambiguating Geographic Names in a Historical Digital Library , 2001, ECDL.

[21]  Francesca Frontini,et al.  Disambiguation of Named Entities in Cultural Heritage Texts Using Linked Data Sets , 2015, ADBIS.

[22]  Antoine Doucet,et al.  Impact of OCR Quality on Named Entity Linking , 2019, ICADL.

[23]  Eero Hyvönen,et al.  Named Entity Linking in a Complex Domain: Case Second World War History , 2017, LDK.

[24]  Mark Dredze,et al.  Entity Disambiguation for Knowledge Base Population , 2010, COLING.

[25]  Eneko Agirre,et al.  Matching Cultural Heritage items to Wikipedia , 2012, LREC.

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Séamus Lawless,et al.  Exploring Linked Data for the Automatic Enrichment of Historical Archives , 2018, SW4CH@ESWC.

[28]  Antoine Doucet,et al.  Robust Named Entity Recognition and Linking on Historical Multilingual Documents , 2020, CLEF.

[29]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[30]  Fabian M. Suchanek,et al.  Mining history with Le Monde , 2013, AKBC '13.