论文信息 - Scaling Information Extraction to Large Document Collections

Scaling Information Extraction to Large Document Collections

Information extraction and text mining applications are just beginning to tap the immense amounts of valuable textual information available online. In order to extract information from millions, and in some cases, billions of documents, different solutions to scalability emerged. We review key approaches for scaling up information extraction, including using general-purpose search engines as well as indexing techniques specialized for information extraction applications. Scalable information extraction is an active area of research, and we highlight some of the opportunities and challenges in this area that are relevant to the database community.

Eugene Agichtein

[1] Oren Etzioni,et al. A search engine for natural language applications , 2005, WWW '05.

[2] Ralph Grishman,et al. Information extraction for enhanced access to disease outbreak reports , 2002, J. Biomed. Informatics.

[3] Andrew Tomkins,et al. How to build a WebFountain: An architecture for very large-scale text analytics , 2004, IBM Syst. J..

[4] William W. Cohen,et al. Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[5] Sergey Brin,et al. Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[6] Luis Gravano,et al. Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[7] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8] Quanzhong Li,et al. Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[9] Doug Downey,et al. KnowItNow: Fast, Scalable Information Extraction from the Web , 2005, HLT.

[10] Ralph Grishman,et al. Information Extraction: Techniques and Challenges , 1997, SCIE.

[11] Eduard Hovy,et al. The Terascale Challenge , 2022 .