Phrasal Terms in Real-World IR Applications

In this chapter we report our investigation on one important issue in the real-world IR environment, i.e., the usefulness, extraction and usage of phrasal terms. One large-scale empirical study has provided supporting evidence that phrasal terms can improve retrieval effectiveness, especially when their relative proximity information is understood from the naturally running text. To automatically identify significant terms for a predefined topic, we have adopted a “gaining data from data” approach. The algorithm learns to select candidate terms through a meaningful comparison of a focused sample with a large and diverse base sample. When investigating whether the identified terms can be useful for other IR applications, we applied these knowledge resources for document summarization and classification. The initial results look quite promising.

[1]  Tomek Strzalkowski,et al.  Document indexing and retrieval using natural language processing , 1994 .

[2]  Paul Roochnik,et al.  Innovations in multilingual name searching , 1994 .

[3]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[4]  Karen Spärck Jones,et al.  Automatic Summarizing , 1995, Inf. Process. Manag..

[5]  ChengXiang Zhai,et al.  Noun-Phrase Analysis in Unrestricted Text for Information Retrieval , 1996, ACL.

[6]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[7]  Alan F. Smeaton,et al.  Experiments on incorporating syntactic processing of user queries into a document retrieval strategy , 1988, SIGIR '88.

[8]  Karen Sparck Jones What is the Role of NLP in Text Retrieval , 1999 .

[9]  Sophia Ananiadou,et al.  Extracting Nested Collocations , 1996, COLING.

[10]  Alan F. Smeaton,et al.  Progress in the Application of Natural Language Processing to Information Retrieval Tasks , 1992, Comput. J..

[11]  Gerard Salton,et al.  A Simple Syntactic Approach for the Generation of Indexing Phrases , 1990 .

[12]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[13]  Stephanie W. Haas,et al.  Constituent object parsing for information retrieval and similar text processing problems , 1989 .

[14]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[15]  Lisa F. Rau,et al.  Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[16]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[17]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[18]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[19]  Tomek Strzalkowski Building A Lexical Domain Map From Text Corpora , 1994, COLING.

[20]  Michael L. Mauldin,et al.  Retrieval performance in Ferret a conceptual information retrieval system , 1991, SIGIR '91.

[21]  Tomek Strzalkowski Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[22]  Tomek Strzalkowski,et al.  Natural Language Information Retrieval: TREC-8 Report , 1994, TREC.

[23]  Ellen M. Voorhees,et al.  The Sixth Text REtrieval Conference (TREC-6) , 2000, Inf. Process. Manag..

[24]  Elisabeth Breidt,et al.  Extraction of V-N-Collocations from Text Corpora: A Feasibility Study for German , 1996, VLC@ACL.

[25]  Antonio Zamora,et al.  The use of titles for automatic document classification , 1980, J. Am. Soc. Inf. Sci..

[26]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[27]  Daniel Frost,et al.  Identification of Domain-Specific Terminology by Combining Mutual Information and Lexical Induction , 1992, European Conference on Artificial Intelligence.

[28]  Kathleen McKeown,et al.  Generating Concise Natural Language Summaries , 1995, Inf. Process. Manag..

[29]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[30]  Zhou,et al.  Period disambiguation using a neural network , 1989 .

[31]  Klaus Zechner,et al.  Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences , 1996, COLING.

[32]  Jin Wang,et al.  Integration of Document Detection and Information Extraction , 1996, TIPSTER.

[33]  W. Bruce Croft,et al.  Term clustering of syntactic phrases , 1989, SIGIR '90.

[34]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[35]  Karen Spärck Jones,et al.  Automatic Search Term variant Generation , 1984, J. Documentation.

[36]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.