Scalable Attribute-Value Extraction from Semi-structured Text

This paper describes a general methodology for extracting attribute-value pairs from web pages. It consists of two phases: candidate generation, in which syntactically likely attribute-value pairs are annotated; and candidate filtering, in which semantically improbable annotations are removed. We describe three types of candidate generators and two types of candidate filters, all of which are designed to be massively parallelizable. Our methods can handle 1 billion web pages in less than 6 hours with 1,000 machines. The best generator and filter combination achieves 70% F-measure compared to a hand-annotated corpus.

[1]  Sharon A. Caraballo Automatic construction of a hypernym-labeled noun hierarchy from text , 1999, ACL.

[2]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[3]  I. V. Ramakrishnan,et al.  On precision and recall of multi-attribute data extraction from semistructured sources , 2003, Third IEEE International Conference on Data Mining.

[4]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[5]  Satoshi Sekine,et al.  Preemptive Information Extraction using Unrestricted Relation Discovery , 2006, NAACL.

[6]  David W. Embley,et al.  Table-processing paradigms: a research survey , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[7]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[8]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[9]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[10]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[11]  Hinrich Schütze,et al.  Customizing a Lexicon to Better Suit a Computational Task , 1996 .

[12]  Benjamin Van Durme,et al.  Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs , 2008, ACL.

[13]  Bo Zhang,et al.  Webpage understanding: an integrated approach , 2007, KDD '07.

[14]  Eugene Charniak,et al.  Finding Parts in Very Large Corpora , 1999, ACL.

[15]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[16]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[17]  Razvan C. Bunescu,et al.  Collective Information Extraction with Relational Markov Networks , 2004, ACL.

[18]  William A. Woods,et al.  What's in a Link: Foundations for Semantic Networks , 1975 .

[19]  Cui Tao,et al.  Automating the extraction of data from HTML tables with unknown structure , 2005, Data Knowl. Eng..

[20]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[21]  Jun'ichi Tsujii,et al.  Extracting ontologies from World Wide Web via HTML tables , 2001 .

[22]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[23]  S. Al-Saffar,et al.  Experimental Bounds on the Usefulness of Personalized and Topic-Sensitive PageRank , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[24]  Massimo Poesio,et al.  Identifying Concept Attributes Using a Classifier , 2005, ACL 2005.

[25]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[26]  Shubin Zhao,et al.  Corroborate and learn facts from the web , 2007, KDD '07.

[27]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[28]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[29]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[30]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[31]  Daisy Zhe Wang,et al.  Uncovering the Relational Web , 2008, WebDB.

[32]  S. M. Cherry Weaving a Web of ideas , 2002 .

[33]  Valter Crescenzi,et al.  Automatic annotation of data extracted from large Web sites , 2003, WebDB.

[34]  James Pustejovsky,et al.  The Generative Lexicon , 1995, CL.