Automatic web-scale information extraction

In this demonstration, we showcase the technologies that we are building at Yahoo! for Web-scale Information Extraction. Given any new Website, containing semi-structured information about a pre-specified set of schemas, we show how to populate objects in the corresponding schema by automatically extracting information from the Website.

[1]  Tobias Anton XPath-Wrapper Induction by generating tree traversal patterns , 2005, LWA.

[2]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[3]  Ravi Kumar,et al.  A web of concepts , 2009, PODS.

[4]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[5]  Raghu Ramakrishnan,et al.  Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach , 2007, VLDB.

[6]  Jussi Myllymaki,et al.  Robust Web Data Extraction with XML Path Expressions , 2002 .

[7]  Pierre Senellart,et al.  Automatic wrapper induction from hidden-web sources with domain knowledge , 2008, WIDM '08.

[8]  Calton Pu,et al.  Wrapping web data into XML , 2001, SGMD.

[9]  Jayant Madhavan,et al.  Harvesting relational tables from lists on the web , 2009, The VLDB Journal.

[10]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[11]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[12]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[13]  Lorenzo Blanco,et al.  Highly efficient algorithms for structural clustering of large websites , 2011, WWW.

[14]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[15]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[16]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.