KINDEX – Automatic Subject Indexing with Knowledge Graph Services

Automatic subject indexing has been a longstanding goal of digital curators to facilitate effective retrieval access to large collections of both online and offline information resources. Controlled vocabularies are often used for this purpose, as they standardise annotation practices and help users to navigate online resources through following interlinked topical concepts. However, to this date, the assignment of suitable text annotations from a controlled vocabulary is still largely done manually, or at most (semi-)automatically, even though effective machine learning tools are already in place. This is because existing procedures require a sufficient amount of training data and they have to be adapted to each vocabulary, language and application domain anew. Against the background of tight budgets and missing IT personnel in cultural heritage as well as research infrastructure institutions, adoption of automatic subject annotation tools is hindered, while manual assignment of index terms is an even greater burden on the available financial resources. In this paper, we argue that there is a third solution to subject indexing which harnesses cross-domain knowledge graphs (i.e., DBpedia and Wikidata) to facilitate cost-effective automatic descriptor assignments that can be done without any algorithm tuning and training corpora. Our KINDEX approach fuses distributed knowledge graph information from different sources. Experimental evaluation shows that the approach achieves good accuracy scores by exploiting correspondence links of publicly available knowledge graphs.

[1]  R. Barzilay Tagging , 2021, English Dialect Dictionary Online.

[2]  A. Pohl,et al.  lobid-gnd – Eine Schnittstelle zur Gemeinsamen Normdatei für Mensch und Maschine , 2019 .

[3]  Lisa Wenige,et al.  SKOS-Based Concept Expansion for LOD-Enabled Recommender Systems , 2018, MTSR.

[4]  Andreas Ledl,et al.  A Semantic Web SKOS Vocabulary Service for Open Knowledge Organization Systems , 2018, MTSR.

[5]  Achim Rettinger,et al.  Which Knowledge Graph Is Best for Me? , 2018, ArXiv.

[6]  L. Wenige,et al.  Retrieval by recommendation: using LOD technologies to improve digital library search , 2018, International Journal on Digital Libraries.

[7]  Christin Seifert,et al.  Descriptor-Invariant Fusion Architectures for Automatic Subject Indexing , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[8]  Natanael Arndt,et al.  OpenResearch: Collaborative Management of Scholarly Communication Metadata , 2016, EKAW.

[9]  Xuanjing Huang,et al.  Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter , 2016, EMNLP.

[10]  K. Tochtermann,et al.  LOD for Library Science: Benefits of Applying Linked Open Data in the Digital Library Setting , 2016, KI - Künstliche Intelligenz.

[11]  George Buchanan,et al.  A framework for evaluating automatic indexing or classification in the context of retrieval , 2016, J. Assoc. Inf. Sci. Technol..

[12]  Ulrike Junger,et al.  Quo vadis Inhaltserschließung der Deutschen Nationalbibliothek? Herausforderungen und Perspektiven , 2015 .

[13]  W. John Wilbur,et al.  Stochastic Gradient Descent and the Prediction of MeSH for PubMed Records , 2014, AMIA.

[14]  Heiko Paulheim,et al.  Adoption of the Linked Data Best Practices in Different Topical Domains , 2014, SEMWEB.

[15]  Johannes Keizer,et al.  The AGROVOC Linked Dataset , 2013, Semantic Web.

[16]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[17]  Klaus Tochtermann,et al.  Linked Library Data: Offering a Backbone for the Semantic Web , 2011, KTW.

[18]  Ian H. Witten,et al.  Subject metadata support powered by Maui , 2010, JCDL '10.

[19]  Sean Bechhofer,et al.  SKOS Simple Knowledge Organization System Reference , 2009 .

[20]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[21]  Roelof van Zwol,et al.  Flickr tag recommendation based on collective knowledge , 2008, WWW.

[22]  Andreas Hotho,et al.  Tag Recommendations in Folksonomies , 2007, LWA.

[23]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[24]  Arlene G. Taylor,et al.  What Have We Got to Lose? The Effect of Controlled Vocabulary on Keyword Searching Results , 2005 .

[25]  Andreas Hotho,et al.  Automatic Multi-label Subject Indexing in a Multilingual Environment , 2003, ECDL.

[26]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[27]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[28]  Bruce R. Schatz,et al.  Automatic subject indexing using an associative neural network , 1998, DL '98.

[29]  W. J. Hutchins The concept of “aboutness” in subject indexing , 1997 .

[30]  Arlene G. Taylor On the subject of subjects , 1995 .

[31]  Ulrike Junger,et al.  Automation first – the subject cataloguing policy of the Deutsche Nationalbibliothek , 2017 .

[32]  Klaus Tochtermann,et al.  Linking science: approaches for linking scientific publications across different LOD repositories , 2017, Int. J. Metadata Semant. Ontologies.

[33]  L. Wenige,et al.  The application of Linked Data resources for Library Recommender Systems , 2017 .

[34]  Vincent Ng,et al.  Automatic Keyphrase Extraction: A Survey of the State of the Art , 2014, ACL.

[35]  Klaus Tochtermann,et al.  Exposing Data From an Open Access Repository for Economics As Linked Data , 2014, D Lib Mag..

[36]  Jan Hannemann,et al.  Linked Data for Libraries , 2010 .

[37]  Joachim Neubert,et al.  Bringing the "Thesaurus for Economics" on to the Web of Linked Data , 2009, LDOW.

[38]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[39]  Michel C. A. Klein,et al.  GoldenBullet: Automated Classification of Product Data in E-commerce , 2002 .

[40]  H Matsuno,et al.  Intelligent system for topic survey in MEDLINE by keyword recommendation and learning text characteristics. , 2000, Genome informatics. Workshop on Genome Informatics.