Gleaning Information from the Web: Using Syntax to Filter Out Irrelevant Information

In this paper, we describe a system called Glean, which is predicated on the idea that any coherent text contains significant latent information, such as syntactic structure and patterns of language use, which can be used to enhance the perlbrmauce of Information Retrieval systems. We propose an approach to information retrieval that makes use of syntactic information obtained using a tool called a supertagger. A supertagger is used on a corpus of training material to semi-automatically induce patterns that we call augmented-patterns. We show how these augmented patterns may be used along with a standard Web search engine or an IR system to retrieve information, and to identify relevant information and filter out irrelevant items. We describe an experiment in the domain of official appointments, where such patterns are shown to reduce the number of potentially irrelevant documents by upwards of 80%.