论文信息 - Gleaning Information from the Web: Using Syntax to Filter Out Irrelevant Information

Gleaning Information from the Web: Using Syntax to Filter Out Irrelevant Information

In this paper, we describe a system called Glean, which is predicated on the idea that any coherent text contains significant latent information, such as syntactic structure and patterns of language use, which can be used to enhance the perlbrmauce of Information Retrieval systems. We propose an approach to information retrieval that makes use of syntactic information obtained using a tool called a supertagger. A supertagger is used on a corpus of training material to semi-automatically induce patterns that we call augmented-patterns. We show how these augmented patterns may be used along with a standard Web search engine or an IR system to retrieve information, and to identify relevant information and filter out irrelevant items. We describe an experiment in the domain of official appointments, where such patterns are shown to reduce the number of potentially irrelevant documents by upwards of 80%.

Raman Chandrasekar | B. Srinivas

[1] Raman Chandrasekar,et al. Using Syntactic Information in Document Filtering: A Comparative Study of Part-of-speech Tagging and Supertagging , 1997, RIAO.

[2] Kenneth Ward Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[3] Ken Arnold,et al. The Java Programming Language , 1996 .

[4] Srinivas Bangalore,et al. The Institute For Research In Cognitive Science Disambiguation of Super Parts of Speech ( or Supertags ) : Almost Parsing by Aravind , 1995 .

[5] Aravind K. Joshi,et al. Parsing Strategies with ‘Lexicalized’ Grammars: Application to Tree Adjoining Grammars , 1988, COLING.

[6] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[7] Beth Ann Hockey,et al. XTAG System - A Wide Coverage Grammar for English , 1994, COLING.

[8] Anoop,et al. Searching the Web with Server-side Filtering of Irrelevant Information Searching the Web with Server-side Filtering of Irrelevant Information , 1997 .

[9] L. R. Rasmussen,et al. In information retrieval: data structures and algorithms , 1992 .