Querying text databases for efficient information extraction

A wealth of information is hidden within unstructured text. This information is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations from a text database by examining every document in the database, or use filters to select promising documents for extraction. The exhaustive scanning approach is not practical or even feasible for large databases, and the current filtering techniques require human involvement to maintain and to adapt to new databases and domains. We develop an automatic query-based technique to retrieve documents useful for the extraction of user-defined relations from large text databases, which can be adapted to new domains, databases, or target relations with minimal human effort. We report a thorough experimental evaluation over a large newspaper archive that shows that we significantly improve the efficiency of the extraction process by focusing only on promising documents.

[1]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[2]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[3]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[4]  Rayid Ghani,et al.  Mining the web to create minority language corpora , 2001, CIKM '01.

[5]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[6]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[7]  Stephen E. Robertson,et al.  On Term Selection for Query Expansion , 1991, J. Documentation.

[8]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[9]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[10]  Soumen Chakrabarti,et al.  Distributed Hypertext Resource Discovery Through Examples , 1999, VLDB.

[11]  Lynette Hirschman,et al.  Mixed-Initiative Development of Language Processing Systems , 1997, ANLP.

[12]  C. Lee Giles,et al.  Extracting query modifications from nonlinear SVMs , 2002, WWW '02.

[13]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[14]  Ralph Grishman,et al.  Unsupervised Discovery of Scenario-Level Patterns for Information Extraction , 2000, ANLP.

[15]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[16]  C. Lee Giles,et al.  DEADLINER: building a new niche search engine , 2000, CIKM '00.

[17]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[18]  Yoram Singer,et al.  Learning to Query the Web , 1996 .

[19]  Robert J. Gaizauskas,et al.  Coupling information retrieval and information extraction: A new text technology for gathering information from the web , 1997, RIAO.

[20]  Scott A. Waterman,et al.  The Diderot information extraction system , 1992 .

[21]  Luis Gravano,et al.  Learning search engine specific query transformations for question answering , 2001, WWW '01.

[22]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[23]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[24]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[25]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[26]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[27]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[28]  Luis Gravano,et al.  QProber: A System for Automatic Classification of Hidden-Web Resources , 2001 .

[29]  Rayid Ghani,et al.  Automatic training data collection for semi-supervised learning of information extraction systems , 2002 .

[30]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[31]  Ralph Grishman,et al.  Real-time event extraction for infectious disease outbreaks , 2002 .

[32]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[33]  Ralph Grishman,et al.  NYU: Description of the Proteus/PET System as Used for MUC-7 ST , 1998, MUC.

[34]  David D. Lewis,et al.  Text filtering in MUC-3 and MUC-4 , 1992, MUC.