Glean: Using Syntactic Information in Document Filtering

In this paper, we describe a system called Glean, which is based on the idea that coherent text contains signi cant latent information, such as syntactic structure and patterns of language use, which can be used to enhance the performance of information retrieval systems. We propose an approach to increase the precision of information retrieval that makes use of syntactic information obtained using a supertagger. In this approach, patterns based on local syntactic context are induced from training material. These patterns are used to re ne the set of documents retrieved by a standard Web search engine or an information retrieval system, by selecting relevant information and ltering out irrelevant items. We show that syntactic information does improve the e ectiveness of ltering irrelevant documents, and that supertagging is more e ective than part of speech tagging in ltering documents. Further, we also show how the extent of syntactic context a ects ltering performance. We discuss the relationship between Glean and other attempts at improving information retrieval performance.

[1]  Srinivas Bangalore,et al.  The Institute For Research In Cognitive Science Disambiguation of Super Parts of Speech ( or Supertags ) : Almost Parsing by Aravind , 1995 .

[2]  Srinivas Bangalore Using Supertags in Document Filtering: the Eeect of Increased Context on Information Retrieval Eeectiveness , 1997 .

[3]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[4]  Harold R. Robison Computer-detectable semantic structures , 1970, Inf. Storage Retr..

[5]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[6]  Boris Katz,et al.  Annotating the World Wide Web using Natural Language , 1997, RIAO.

[7]  Raman Chandrasekar,et al.  Gleaning Information from the Web: Using Syntax to Filter Out Irrelevant Information , 1996 .

[8]  Beth Ann Hockey,et al.  XTAG System - A Wide Coverage Grammar for English , 1994, COLING.

[9]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[10]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[11]  Shivakumar Vaithyanathan,et al.  Exploiting clustering and phrases for context-based information retrieval , 1997, SIGIR '97.

[12]  Gregory Grefenstette Short Query Linguistic Expansion Techniques: Palliating One-Word Queries by Providing Intermediate Structure to Text , 1997, SCIE.

[13]  Robert R. Korfhage,et al.  To see, or not to see— is That the query? , 1991, SIGIR '91.

[14]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[15]  Ellen M. Voorhees,et al.  Using WordNet to disambiguate word senses for text retrieval , 1993, SIGIR.

[16]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[17]  Marti A. Hearst,et al.  Cat-a-Cone: an interactive interface for specifying searches and viewing retrieval results using a large category hierarchy , 1997, SIGIR '97.

[18]  Raman Chandrasekar,et al.  Using Syntactic Information in Document Filtering: A Comparative Study of Part-of-speech Tagging and Supertagging , 1997, RIAO.

[19]  Srinivas Bangalore,et al.  Complexity of lexical descriptions and its relevance to partial parsing , 1997 .

[20]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[21]  Srinivas Bangalore,et al.  Supertagging: An Approach to Almost Parsing , 1999, CL.

[22]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[23]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[24]  Aravind K. Joshi,et al.  Parsing Strategies with ‘Lexicalized’ Grammars: Application to Tree Adjoining Grammars , 1988, COLING.

[25]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[26]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[27]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.