Web-Scale Data Integration: You can afford to Pay as You Go

The World Wide Web is witnessing an increase in the amount of structured content – vast heterogeneous collections of structured data are on the rise due to the Deep Web, annotation schemes like Flickr, and sites like Google Base. While this phenomenon is creating an opportunity for structured data management, dealing with heterogeneity on the web-scale presents many new challenges. In this paper, we highlight these challenges in two scenarios – the Deep Web and Google Base. We contend that traditional data integration techniques are no longer valid in the face of such heterogeneity and scale. We propose a new data integration architecture, PAYGO, which is inspired by the concept of dataspaces and emphasizes pay-as-you-go data management as means for achieving web-scale data integration.

[1]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[2]  Oren Etzioni,et al.  Crossing the Structure Chasm , 2003, CIDR.

[3]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[4]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[5]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[6]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[7]  Clement T. Yu,et al.  WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce , 2003, VLDB.

[8]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[9]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[10]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[11]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[12]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[13]  斉藤 康己,et al.  Douglas B. Lenat and R. V. Guha : Building Large Knowledge-Based Systems, Representation and Inference in the Cyc Project, Addison-Wesley (1990). , 1990 .

[14]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[15]  Alon Y. Halevy,et al.  A Platform for Personal Information Management and Integration , 2005, CIDR.

[16]  Raghu Ramakrishnan,et al.  DBLife: A Community Information Management Platform for the Database Research Community (Demo) , 2007, CIDR.

[17]  Jayant Madhavan,et al.  Structured Data Meets the Web: A Few Observations , 2006, IEEE Data Eng. Bull..

[18]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[19]  Philip A. Bernstein,et al.  Applying Model Management to Classical Meta Data Problems , 2003, CIDR.

[20]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[21]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[22]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[23]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[24]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[25]  Sihem Amer-Yahia,et al.  Report on the DB/IR panel at SIGMOD 2005 , 2005, SGMD.

[26]  Kevin Chen-Chuan Chang,et al.  Statistical Schema Integration across the Deep Web , 2002 .

[27]  Jens Dittrich,et al.  A Dataspace Odyssey: The iMeMex Personal Dataspace Management System (Demo) , 2007, CIDR.

[28]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.