Exploring entity recognition and disambiguation for cultural heritage collections

Unstructured metadata fields such as ‘description’ offer tremendous value for users to understand cultural heritage objects. However, this type of narrative information is of little direct use within a machine-readable context due to its unstructured nature. This article explores the possibilities and limitations of named-entity recognition (NER) and term extraction (TE) to mine such unstructured metadata for meaningful concepts. These concepts can be used to leverage otherwise limited searching and browsing operations, but they can also play an important role to foster Digital Humanities research. To catalyze experimentation with NER and TE, the article proposes an evaluation of the performance of three third-party entity extraction services through a comprehensive case study, based on the descriptive fields of the Smithsonian Cooper–Hewitt National Design Museum in New York. To cover both NER and TE, we first offer a quantitative analysis of named entities retrieved by the services in terms of precision and recall compared with a manually annotated gold-standard corpus, and then complement this approach with a more qualitative assessment of relevant terms extracted. Based on the outcomes of this double analysis, the conclusions present the added value of entity extraction services, but also indicate the dangers of uncritically using NER and/or TE, and by extension Linked Data principles, within the Digital Humanities. All metadata and tools used within the article are freely available, making it possible for researchers and practitioners to repeat the methodology. By doing so, the article offers a significant contribution towards understanding the value of entity recognition and disambiguation for the Digital Humanities.

[1]  Tobias Blanke,et al.  Comparison of named entity recognition tools for raw OCR text , 2012, KONVENS.

[2]  Lora Aroyo,et al.  Automatic Heritage Metadata Enrichment with Historic Events , 2011 .

[3]  T. Brown,et al.  Debates in the Digital Humanities , 2013 .

[4]  Tim Berners-Lee,et al.  Linked data , 2020, Semantic Web for the Working Ontologist.

[5]  Peter Brusilovsky,et al.  Imagesieve: Exploratory search of museum archives with named entity-based faceted browsing , 2010, ASIST.

[6]  Jean Tague-Sutcliffe Some perspectives on the evaluation of information retrieval systems , 1996 .

[7]  Douglas Tudhope,et al.  A STELLAR role for knowledge organization systems in digital archaeology , 2011 .

[8]  Maarten Marx,et al.  Two-Stage Named-Entity Recognition Using Averaged Perceptrons , 2012, NLDB.

[9]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[10]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[11]  Roy T. Fielding,et al.  Uniform Resource Identifiers (URI): Generic Syntax , 1998, RFC.

[12]  Claus Zinn,et al.  Integrated access to cultural heritage resources through representation and alignment of controlled vocabularies , 2008 .

[13]  Saul A. Kripke,et al.  Naming and Necessity , 1980 .

[14]  Martin Doerr,et al.  Semantic Problems of Thesaurus Mapping , 2006, J. Digit. Inf..

[15]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[16]  Roy T. Fielding,et al.  Uniform Resource Identifier (URI): Generic Syntax , 2005, RFC.

[17]  Eva M. Méndez Rodríguez,et al.  Opportunities and risks for libraries in applying for European funding , 2011, Electron. Libr..

[18]  Isabelle Boydens Informatique, normes et temps , 1999 .

[19]  Tim Berners-Lee,et al.  Uniform Resource Locators (URL) , 1994, RFC.

[20]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[21]  Raphaël Troncy,et al.  NERD: evaluating named entity recognition tools in the web of data , 2011 .

[22]  Ganesh Ramakrishnan,et al.  Collective annotation of Wikipedia entities in web text , 2009, KDD.

[23]  S. Thomas Ng,et al.  Performance of engineering consultants in ISO 9000‐based quality management systems implementation , 2005 .

[24]  Claus Zinn,et al.  A Web-Based Repository Service for Vocabularies and Alignments in the Cultural Heritage Domain , 2010, ESWC.

[25]  Lora Aroyo,et al.  Hacking History: Automatic Historical Event Extraction for Enriching Cultural Heritage Multimedia Collections , 2011, DeRiVE@ISWC.

[26]  S. Ramsay,et al.  Developing Things: Notes toward an Epistemology of Building in the Digital Humanities , 2012 .

[27]  Seth van Hooland,et al.  Hermeneutics applied to the quality of empirical databases , 2011, J. Documentation.

[28]  Imed Boughzala,et al.  Practical Studies in E-Government: Best Practices from Around the World , 2010 .

[29]  Jenn Riley,et al.  Metadata for digital resources , 2008 .

[30]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[31]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[32]  Geoffrey Rockwell,et al.  Debates in the Digital Humanities , 2012 .

[33]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[34]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[35]  Nelleke Oostdijk,et al.  From D-Coi to SoNaR: a reference corpus for Dutch , 2008, LREC.

[36]  Marie-Francine Moens,et al.  Information Extraction: Algorithms and Prospects in a Retrieval Context , 2006, The Information Retrieval Series.

[37]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[38]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[39]  Frederick Reiss,et al.  Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks , 2010, EMNLP.

[40]  Isabelle Boydens,et al.  Strategic Issues Relating to Data Quality for E-Government: Learning from an Approach Adopted in Belgium , 2011 .

[41]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[42]  Luciano Serafini,et al.  Context-Driven Semantic Enrichment of Italian News Archive , 2010, ESWC.

[43]  Rik Van de Walle,et al.  Evaluating the success of vocabulary reconciliation for cultural heritage collections , 2013, J. Assoc. Inf. Sci. Technol..