Creating Permanent Test Collections of Web Pages for Information Extraction Research

In the research area of automatic web information extraction, there is a need for permanent and annotated web page collections enabling objective performance evaluation of different algorithms. Currently, researchers are suffering from the absence of such representative and contemporary test collections, especially on web tables. At the same time, creating your own sharable web page collections is not trivial nowadays because of the dynamic and diverse nature of modern web technologies employed to create often shortlived online content. In this paper, we cover the problem of creating static representations of web pages in order to build sharable ground truth test sets. We explain the principal difficulties of the problem, discuss possible approaches and introduce our solution: WebPageDump, a Firefox extension capable of saving web pages exactly as they are rendered online. Finally, we benchmark our system with current alternatives using an innovative automatic method based on image snapshots.

[1]  Andrew M. Webb,et al.  combinFormation: a mixed-initiative system for representing collections as compositions of image and text surrogates , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[2]  Wolfgang Gatterbauer,et al.  Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model , 2006, AAAI.

[3]  Peter Bailey,et al.  Toward meaningful test collections for information integration benchmarking , 2006 .

[4]  Marcus Herzog,et al.  Using Ontologies for Extracting Product Features from Web Pages , 2006, SEMWEB.

[5]  Natarajan Kannan,et al.  Live URLs: breathing life into URLs , 2006, WWW '06.

[6]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[7]  Matthew Hurst,et al.  Layout and Language: Challenges for Table Understanding on the Web , 2001 .

[8]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[9]  Michael J. Day,et al.  Preserving the Fabric of Our Lives: A Survey of Web , 2003, ECDL.

[10]  David R. Karger,et al.  Piggy Bank: Experience the Semantic Web inside your web browser , 2005, J. Web Semant..

[11]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[12]  Daniel P. Lopresti,et al.  Why table ground-truthing is hard , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[13]  Georg Gottlob,et al.  The Lixto Project: Exploring New Frontiers of Web Data Extraction , 2006, BNCOD.