Toward effective automated weighted subject indexing: A comparison of different approaches in different environments

Subject indexing plays an important role in supporting subject access to information resources. Current subject indexing systems do not make adequate distinctions on the importance of assigned subject descriptors. Assigning numeric weights to subject descriptors to distinguish their importance to the documents can strengthen the role of subject metadata. Automated methods are more cost‐effective. This study compares different automated weighting methods in different environments. Two evaluation methods were used to assess the performance. Experiments on three datasets in the biomedical domain suggest the performance of different weighting methods depends on whether it is an abstract or full text environment. Mutual information with bag‐of‐words representation shows the best average performance in the full text environment, while cosine with bag‐of‐words representation is the best in an abstract environment. The cosine measure has relatively consistent and robust performance. A direct weighting method, IDF (Inverse Document Frequency), can produce quick and reasonable estimates of the weights. Bag‐of‐words representation generally outperforms the concept‐based representation. Further improvement in performance can be obtained by using the learning‐to‐rank method to integrate different weighting methods. This study follows up Lu and Mao (Journal of the Association for Information Science and Technology, 66, 1776–1784, 2015), in which an automated weighted subject indexing method was proposed and validated. The findings from this study contribute to more effective weighted subject indexing.

[1]  Kun Lu,et al.  Automatically infer subject terms and documents associations through text mining , 2013, ASIST.

[2]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[3]  Matthew Hurst,et al.  A Language Model Approach to Keyphrase Extraction , 2003, ACL 2003.

[4]  Birger Hjørland,et al.  Subject (of Documents) , 2017 .

[5]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[6]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[7]  Birger Hjørland,et al.  The importance of theories of knowledge: Indexing and information retrieval as an example , 2011, J. Assoc. Inf. Sci. Technol..

[8]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[9]  Leo Egghe,et al.  An approach to similarity measurement of absence-presence data: the case that common zeros matter , 2004, J. Inf. Sci..

[10]  William S. Cooper,et al.  Foundations of Probabilistic and Utility-Theoretic Indexing , 1978, JACM.

[11]  Ari Cohen Subject Analysis , 2000 .

[12]  Dolf Trieschnigg,et al.  Measuring concept relatedness using language models , 2008, SIGIR '08.

[13]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[14]  Gerard Salton,et al.  A theory of indexing , 1975, Regional conference series in applied mathematics.

[15]  Birger Hjørland,et al.  The Concept of 'subject' in Information Science , 1992, J. Documentation.

[16]  Virginia A. Lingle,et al.  Indexing and Abstracting in Theory and Practice , 2005 .

[17]  Alan Gilchrist,et al.  Thesaurus construction: a practical manual , 1972 .

[18]  Ina Fourie Powering Search: The Role of Thesauri in New Information Environments , 2014 .

[19]  W. Bruce Croft,et al.  Statistical language modeling for information retrieval , 2006, Annu. Rev. Inf. Sci. Technol..

[20]  Linda C. Smith,et al.  Seeing the Wood for the Trees: Enhancing Metadata Subject Elements with Weights , 2011 .

[21]  Kun Lu,et al.  Mining document, concept, and term associations for effective biomedical retrieval: introducing MeSH-enhanced retrieval models , 2015, Information Retrieval Journal.

[22]  Derek Wilton Langridge Subject Analysis: Principles and Procedures , 1989 .

[23]  Kun Lu,et al.  An automatic approach to weighted subject indexing—an empirical study in the biomedical domain , 2015, J. Assoc. Inf. Sci. Technol..

[24]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[25]  Sarah Morgan Powering Search: The Role of Thesauri in New Information Environments , 2013 .

[26]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[27]  Birger Hjørland,et al.  Knowledge Organization , 2005 .

[28]  Kun Lu,et al.  Understanding the retrieval effectiveness of collaborative tags and author keywords in different retrieval environments: An experimental study on medical collections , 2014, J. Assoc. Inf. Sci. Technol..

[29]  Birger Hjørland What is Knowledge Organization (KO) , 2008 .

[30]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[31]  P. Wilson Two kinds of power : an essay on bibliographical control , 1978 .

[32]  Cyril W. Cleverdon,et al.  The significance of the Cranfield tests on index languages , 1991, SIGIR '91.

[33]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[34]  Arlene G. Taylor,et al.  The Organization of Information , 1999 .

[35]  M. E. Maron,et al.  On indexing, retrieval and the meaning of about , 1977, J. Am. Soc. Inf. Sci..

[36]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[37]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[38]  Raya Fidel,et al.  User-Centered Indexing , 1994, J. Am. Soc. Inf. Sci..

[39]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[40]  Kun Lu,et al.  Enhancing Subject Metadata with Automated Weighting in the Medical Domain: A Comparison of Different Measures , 2015, ICADL.