Scaling Information Extraction to Large Document Collections

Information extraction and text mining applications are just beginning to tap the immense amounts of valuable textual information available online. In order to extract information from millions, and in some cases, billions of documents, different solutions to scalability emerged. We review key approaches for scaling up information extraction, including using general-purpose search engines as well as indexing techniques specialized for information extraction applications. Scalable information extraction is an active area of research, and we highlight some of the opportunities and challenges in this area that are relevant to the database community.

[1]  Oren Etzioni,et al.  A search engine for natural language applications , 2005, WWW '05.

[2]  Ralph Grishman,et al.  Information extraction for enhanced access to disease outbreak reports , 2002, J. Biomed. Informatics.

[3]  Andrew Tomkins,et al.  How to build a WebFountain: An architecture for very large-scale text analytics , 2004, IBM Syst. J..

[4]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[5]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[6]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[9]  Doug Downey,et al.  KnowItNow: Fast, Scalable Information Extraction from the Web , 2005, HLT.

[10]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[11]  Eduard Hovy,et al.  The Terascale Challenge , 2022 .

[12]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[13]  Dragomir R. Radev,et al.  Question-answering by predictive annotation , 2000, SIGIR '00.

[14]  Gerhard Weikum,et al.  Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? , 2005, CIDR.

[15]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[16]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[17]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[18]  Gultekin Özsoyoglu,et al.  Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach , 2003, DEXA.

[19]  Eduard Hovy,et al.  Towards terascale knowledge acquisition , 2004, COLING 2004.

[20]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[21]  Luis Gravano,et al.  Querying text databases for efficient information extraction , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[22]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[23]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[24]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[25]  Maria Teresa Pazienza,et al.  Information Extraction A Multidisciplinary Approach to an Emerging Information Technology , 1997, Lecture Notes in Computer Science.

[26]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[27]  Matt Mettler,et al.  TREC Routing Experiments with the TRW/Paracel Fast Data Finder , 1993, Inf. Process. Manag..

[28]  Kenneth C. Litkowski,et al.  Question Answering Using XML-Tagged Documents , 2002, TREC.

[29]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[30]  Philip Resnik,et al.  The Linguist's Search Engine: An Overview , 2005, ACL.

[31]  Patrick Pantel,et al.  Towards Terascale Semantic Acquisition , 2004, COLING.