Uncovering the Relational Web

World-Wide Web consists of a huge number of unstruc- tured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small "schema" of labeled and typed columns, making each such table a small structured database. The WebTables project is an effort to extract and make use of the huge number of these structured tables on the Web. A clean collection of relational-style ta- bles could be useful for improving web search, schema de- sign, and many other applications. This paper describes the first stage of the WebTables project. First, we give an in-depth study of the Web's HTML table corpus. For example, we extracted 14.1 billion HTML ta- bles from a several-billion-page portion of Google's general- purpose web crawl, and estimate that 154 million of these tables contain high-quality relational-style data. We also de- scribe the crawl's distribution of table sizes and data types. Second, we describe a system for performing relation recov- ery. The Web mixes relational and non-relational tables indiscriminately (often on the same page), so there is no simple way to distinguish the 1.1% of good relations from the remainder, nor to recover column label and type infor- mation. Our mix of hand-written detectors and statistical classifiers takes a raw Web crawl as input, and generates a collection of databases that is five orders of magnitude larger than any other collection we are aware of. Relation recovery achieves precision and recall that are comparable to other domain-independent information extraction systems.

[1]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[2]  Hsin-Hsi Chen,et al.  Mining Tables from Large Scale HTML Texts , 2000, COLING.

[3]  Kevin Chen-Chuan Chang,et al.  Knocking the door to the deep Web: integrating Web query interfaces , 2004, SIGMOD '04.

[4]  Jianying Hu,et al.  Flexible Web document analysis for delivery to narrow-bandwidth devices , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[5]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[6]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[7]  J. Cordy,et al.  A Survey of Table Recognition : Models , Observations , Transformations , and Inferences , 2003 .

[8]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[9]  Jayant Madhavan,et al.  Structured Data Meets the Web: A Few Observations , 2006, IEEE Data Eng. Bull..

[10]  Luis Gravano,et al.  Snowball: a prototype system for extracting relations from large text collections , 2001, SIGMOD '01.

[11]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[12]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[13]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[14]  Wo-Shun Luk,et al.  A framework for web table mining , 2002, WIDM '02.