Indexing Text Documents Based on Topic Identification

This work provides algorithms and heuristics to index text documents by determining important topics in the documents. To index text documents, the work provides algorithms to generate topic candidates, determine their importance, detect similar and synonym topics, and to eliminate incoherent topics. The indexing algorithm uses topic frequency to determine the importance and the existence of the topics. Repeated phrases are topic candidates. For example, since the phrase ’index text documents’ occurs three times in this abstract, the phrase is one of the topics of this abstract. It is shown that this method is more effective than either a simple word count model or approaches based on term weighting.

[1]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[2]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[3]  Joel L. Fagan,et al.  Automatic Phrase Indexing for Document Retrieval: An Examination of Syntactic and Non-Syntactic Methods , 1987, SIGIR.

[4]  Donna K. Harman,et al.  Ranking Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[5]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[6]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[7]  Alan F. Smeaton,et al.  Automatic Phrase Recognition and Extraction from Text , 1997, BCS-IRSG Annual Colloquium on IR Research.

[8]  Chin-Yew Lin,et al.  Robust automated topic identification , 1997 .

[9]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[10]  Gerald Salton,et al.  Automatic text processing , 1988 .

[11]  Ricky K. Taira,et al.  Creating and indexing teaching files from free-text patient reports , 1999, AMIA.

[12]  William A. Woods,et al.  Conceptual Indexing: A Better Way to Organize Knowledge , 1997 .

[13]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[14]  Nina Wacholder,et al.  Automatic identification and organization of index terms for interactive browsing , 2001, JCDL '01.

[15]  Ronald M. Kaplan,et al.  Finite state technology , 1997 .