Impact of OCR Quality on Named Entity Linking

Digital libraries are online collections of digital objects that can include text, images, audio, or videos. It has long been observed that named entities (NEs) are key to the access to digital library portals as they are contained in most user queries. Combined or subsequent to the recognition of NEs, named entity linking (NEL) connects NEs to external knowledge bases. This allows to differentiate ambiguous geographical locations or names (John Smith), and implies that the descriptions from the knowledge bases can be used for semantic enrichment. However, the NEL task is especially challenging for large quantities of documents as the diversity of NEs is increasing with the size of the collections. Additionally digitized documents are indexed through their OCRed version which may contains numerous OCR errors. This paper aims to evaluate the performance of named entity linking over digitized documents with different levels of OCR quality. It is the first investigation that we know of to analyze and correlate the impact of document degradation on the performance of NEL. We tested state-of-the-art NEL techniques over several evaluation benchmarks, and experimented with various types of OCR noise. We present the resulting study and subsequent recommendations on the adequate documents and OCR quality levels required to perform reliable named entity linking. We further provide the first evaluation benchmark for NEL over degraded documents.

[1]  Zhaochen Guo,et al.  Robust Entity Linking via Random Walks , 2014, CIKM.

[2]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[3]  Thomas Hofmann,et al.  Deep Joint Entity Disambiguation with Local Neural Attention , 2017, EMNLP.

[4]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[5]  R. Manmatha,et al.  A Fast Alignment Scheme for Automatic OCR Evaluation of Books , 2011, 2011 International Conference on Document Analysis and Recognition.

[6]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[7]  Eric K. Ringger,et al.  Combining multiple thresholding binarization values to improve OCR output , 2013, Electronic Imaging.

[8]  Ivan Titov,et al.  Improving Entity Linking by Modeling Latent Relations between Mentions , 2018, ACL.

[9]  Jian Su,et al.  Entity Linking with Effective Acronym Expansion, Instance Selection, and Topic Modeling , 2011, IJCAI.

[10]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[11]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[12]  Muriel Visani,et al.  DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images , 2017, J. Imaging.

[13]  Mickaël Coustaty,et al.  Impact of OCR Errors on the Use of Digital Libraries: Towards a Better Access to Information , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[14]  Xiaoyan Zhu,et al.  Learning to Link Entities with Knowledge Base , 2010, NAACL.

[15]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[16]  Thomas Hofmann,et al.  End-to-End Neural Entity Linking , 2018, CoNLL.

[17]  Yang Li,et al.  Mining evidences for named entity disambiguation , 2013, KDD.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[20]  Ming-Wei Chang,et al.  To Link or Not to Link? A Study on End-to-End Tweet Entity Linking , 2013, NAACL.

[21]  Xianpei Han,et al.  NLPR_KBP in TAC 2009 KBP Track: A Two-Stage Method to Entity Linking , 2009, TAC.

[22]  Olivier Raiman,et al.  DeepType: Multilingual Entity Linking by Neural Type System Evolution , 2018, AAAI.

[23]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[24]  Rose Holley How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[25]  Mark Dredze,et al.  Entity Disambiguation for Knowledge Base Population , 2010, COLING.