Document indexing: a concept-based approach to term weight estimation

Traditional index weighting approaches for information retrieval from texts depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in extracting semantically exact indexes that represent the semantic content of a document. To address this issue, we developed a new indexing formalism that considers not only the terms in a document, but also the concepts. In this approach, concept clusters are defined and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document. Through an experiment on the TREC collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the few highest-ranked documents. Moreover, the index term dimension was 80% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment.

[1]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[2]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[3]  M. Sherwood-Smith,et al.  Lexical chains for topic tracking , 2002, IEEE International Conference on Systems, Man and Cybernetics.

[4]  Marie-Francine Moens,et al.  Automatic Indexing and Abstracting of Document Texts , 2000, Computational Linguistics.

[5]  Karen Spärck Jones Index term weighting , 1973, Inf. Storage Retr..

[6]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[7]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[8]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[9]  Jon Kleinberg,et al.  The Structure of the Web , 2001, Science.

[10]  Rick Kazman,et al.  Accessing multimedia through concept clustering , 1997, CHI.

[11]  Rick Kazman,et al.  Four Paradigms for Indexing Video Conferences , 1996, IEEE Multim..

[12]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Joshua B. Tenenbaum,et al.  The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth , 2001, Cogn. Sci..

[15]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[16]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[17]  Alexander Budanitsky,et al.  Lexical Semantic Relatedness and Its Application in Natural Language Processing , 1999 .

[18]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[19]  Gerard Salton,et al.  A theory of indexing , 1975, Regional conference series in applied mathematics.