论文信息 - Uncovering the Relational Web

Uncovering the Relational Web

World-Wide Web consists of a huge number of unstruc- tured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small "schema" of labeled and typed columns, making each such table a small structured database. The WebTables project is an effort to extract and make use of the huge number of these structured tables on the Web. A clean collection of relational-style ta- bles could be useful for improving web search, schema de- sign, and many other applications. This paper describes the first stage of the WebTables project. First, we give an in-depth study of the Web's HTML table corpus. For example, we extracted 14.1 billion HTML ta- bles from a several-billion-page portion of Google's general- purpose web crawl, and estimate that 154 million of these tables contain high-quality relational-style data. We also de- scribe the crawl's distribution of table sizes and data types. Second, we describe a system for performing relation recov- ery. The Web mixes relational and non-relational tables indiscriminately (often on the same page), so there is no simple way to distinguish the 1.1% of good relations from the remainder, nor to recover column label and type infor- mation. Our mix of hand-written detectors and statistical classifiers takes a raw Web crawl as input, and generates a collection of databases that is five orders of magnitude larger than any other collection we are aware of. Relation recovery achieves precision and recall that are comparable to other domain-independent information extraction systems.

[1] Peter D. Turney. Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[2] Hsin-Hsi Chen,et al. Mining Tables from Large Scale HTML Texts , 2000, COLING.

[3] Kevin Chen-Chuan Chang,et al. Knocking the door to the deep Web: integrating Web query interfaces , 2004, SIGMOD '04.

[4] Jianying Hu,et al. Flexible Web document analysis for delivery to narrow-bandwidth devices , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[5] Daisy Zhe Wang,et al. WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[6] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[7] J. Cordy,et al. A Survey of Table Recognition : Models , Observations , Transformations , and Inferences , 2003 .

[8] Doug Downey,et al. Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[9] Jayant Madhavan,et al. Structured Data Meets the Web: A Few Observations , 2006, IEEE Data Eng. Bull..

[10] Luis Gravano,et al. Snowball: a prototype system for extracting relations from large text collections , 2001, SIGMOD '01.

[11] Yalin Wang,et al. A machine learning based approach for table detection on the web , 2002, WWW '02.

[12] Wolfgang Gatterbauer,et al. Towards domain-independent information extraction from web tables , 2007, WWW '07.

[13] AnHai Doan,et al. Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[14] Wo-Shun Luk,et al. A framework for web table mining , 2002, WIDM '02.