论文信息 - Representing Aboutness: Automatically Indexing 19th- Century Encyclopedia Britannica Entries

Representing Aboutness: Automatically Indexing 19th- Century Encyclopedia Britannica Entries

Representing aboutness is a challenge for humanities documents, given the linguistic indeterminacy of the text. The challenge is even greater when applying automatic indexing to historical documents for a multidisciplinary collection, such as encyclopedias. The research presented in this paper explores this challenge with an automatic indexing comparative study examining topic relevance. The setting is the NEH-funded 19th-Century Knowledge Project, where researchers in the Digital Scholarship Center, Temple University, and the Metadata Research Center, Drexel University, are investigating the best way to index entries across four historical editions of the Encyclopedia Britannica (3rd, 7th, 9th, and 11th editions). Individual encyclopedia entry entries were processed using the Helping Interdisciplinary Vocabulary Engineering (HIVE) system, a linked-data, automatic indexing terminology application that uses controlled vocabularies. Comparative topic relevance evaluation was performed for three separate keyword extraction algorithms: RAKE, Maui, and Kea++. Results show that RAKE performed the best, with an average of 67% precision for RAKE, and 28% precision for both Maui and Kea++. Additionally, the highest-ranked HIVE results with both RAKE and Kea++ demonstrated relevance across all sample entries, while Maui’s highest-ranked results returned zero relevant terms. This paper reports on background information, research objectives and methods, results, and future research prospects for further optimization of RAKE’s algorithm parameters to accommodate for encyclopedia entries of different lengths, and evaluating the indexing impact of correcting the historical Long S.

[1] Tefko Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance , 2007 .

[2] Hope A. Olson,et al. Syntagmatic relationships and indexing consistency on a larger scale , 2008, J. Documentation.

[3] Nick Cramer,et al. Automatic Keyword Extraction from Individual Documents , 2010 .

[4] George Buchanan,et al. A framework for evaluating automatic indexing or classification in the context of retrieval , 2016, J. Assoc. Inf. Sci. Technol..

[5] Ian H. Witten,et al. Subject metadata support powered by Maui , 2010, JCDL '10.

[6] G. Bueno-de-la-Fuente,et al. Automatic Text Indexing with SKOS Vocabularies in HIVE , 2016 .

[7] Virginia A. Lingle,et al. Indexing and Abstracting in Theory and Practice , 2005 .

[8] Sheila Bair,et al. Where Keywords Fail: Using Metadata to Facilitate Digital Humanities Scholarship , 2008 .

[9] Erica Cosijn. Relevance Judgments and Measurements , 2010 .

[10] Birger Hjørland,et al. Work tasks and socio-cognitive relevance: A specific example , 2002, J. Assoc. Inf. Sci. Technol..

[11] Marie-Francine Moens,et al. Automatic Indexing and Abstracting of Document Texts , 2000, Computational Linguistics.

[12] Ian H. Witten,et al. Domain-independent automatic keyphrase indexing with small training sets , 2008, J. Assoc. Inf. Sci. Technol..