AMBER: turning annotations into knowledge

Web extraction is the task of turning unstructured HTML into knowledge. Computers are able to generate annotations of unstructured HTML, but it is more important to turn those annotations into structured knowledge. Unfortunately, the current systems extracting knowledge from result pages lack accuracy. In this proposal, we present AMBER, a system fully automated turning annotations to structured knowledge from any result page of a given domain. AMBER observes basic domain attributes on a page and leverages repeated occurrences of similar attributes to group related attributes into records. This contrasts to previous approaches that analyze the repeated structure only of the HTML, as no domain knowledge is available. Our multi-domain experimental evaluation on hundreds of sites demonstrates that AMBER achieves accuracy (>98%) comparable to skilled human annotator.

[1]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[2]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[3]  Weifeng Su,et al.  ODE: Ontology-assisted data extraction , 2009, TODS.

[4]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[5]  Weiyi Meng,et al.  Vision-based Web Data Records Extraction , 2006, WebDB.

[6]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[7]  Hector Garcia-Molina,et al.  Semistructured Data: The Tsimmis Experience , 1997, ADBIS.

[8]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[9]  Maurice Bruynooghe,et al.  Information extraction from structured documents using k-testable tree automaton inference , 2006, Data Knowl. Eng..

[10]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[11]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[12]  Kalina Bontcheva,et al.  Text Processing with GATE , 2011 .

[13]  Alberto H. F. Laender,et al.  DEByE - Data Extraction By Example , 2002, Data Knowl. Eng..

[14]  Khaled Shaalan,et al.  FiVaTech: Page-Level Web Data Extraction from Template Pages , 2007, IEEE Transactions on Knowledge and Data Engineering.

[15]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  Pierre Senellart,et al.  Automatic wrapper induction from hidden-web sources with domain knowledge , 2008, WIDM '08.

[17]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[18]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[19]  Jian Pei,et al.  Can we learn a template-independent wrapper for news article extraction from a single training site? , 2009, KDD.