论文信息 - Glean: Using Syntactic Information in Document Filtering

Glean: Using Syntactic Information in Document Filtering

In this paper, we describe a system called Glean, which is based on the idea that coherent text contains signi cant latent information, such as syntactic structure and patterns of language use, which can be used to enhance the performance of information retrieval systems. We propose an approach to increase the precision of information retrieval that makes use of syntactic information obtained using a supertagger. In this approach, patterns based on local syntactic context are induced from training material. These patterns are used to re ne the set of documents retrieved by a standard Web search engine or an information retrieval system, by selecting relevant information and ltering out irrelevant items. We show that syntactic information does improve the e ectiveness of ltering irrelevant documents, and that supertagging is more e ective than part of speech tagging in ltering documents. Further, we also show how the extent of syntactic context a ects ltering performance. We discuss the relationship between Glean and other attempts at improving information retrieval performance.

Raman Chandrasekar | Srinivas Bangalore

[1] Srinivas Bangalore,et al. The Institute For Research In Cognitive Science Disambiguation of Super Parts of Speech ( or Supertags ) : Almost Parsing by Aravind , 1995 .

[2] Srinivas Bangalore. Using Supertags in Document Filtering: the Eeect of Increased Context on Information Retrieval Eeectiveness , 1997 .

[3] George A. Miller,et al. Introduction to WordNet: An On-line Lexical Database , 1990 .

[4] Harold R. Robison. Computer-detectable semantic structures , 1970, Inf. Storage Retr..

[5] Gerard Salton,et al. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[6] Boris Katz,et al. Annotating the World Wide Web using Natural Language , 1997, RIAO.

[7] Raman Chandrasekar,et al. Gleaning Information from the Web: Using Syntax to Filter Out Irrelevant Information , 1996 .

[8] Beth Ann Hockey,et al. XTAG System - A Wide Coverage Grammar for English , 1994, COLING.

[9] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[10] Kenneth Ward Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[11] Shivakumar Vaithyanathan,et al. Exploiting clustering and phrases for context-based information retrieval , 1997, SIGIR '97.