Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web

The Web has been rapidly “deepened” by myriad searchable databases online, where data are hidden behind query interfaces. Toward large scale integration over this “deep Web,” we have been building the MetaQuerier system– for both exploring (to find) and integrating (to query) databases on the Web. As an interim report, first, this paper proposes our goal of the MetaQuerier for Web-scale integration– With its dynamic and ad-hoc nature, such large scale integration mandates both dynamic source discovery and on-thefly query translation. Second, we present the system architecture and underlying technology of key subsystems in our ongoing implementation. Third, we discuss “lessons” learned to date, focusing on our efforts in system integration, for putting individual subsystems to function together. On one hand, we observe that, across subsystems, the system integration of an integration system is itself non-trivial– which presents both challenges and opportunities beyond subsystems in isolation. On the other hand, we also observe that, across subsystems, there emerge unified insights of “holistic integration”– which leverage large scale itself as a unique opportunity for information integration.

[1]  Serge Abiteboul,et al.  Complexity of answering queries using materialized views , 1998, PODS.

[2]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[3]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[4]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[5]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[6]  Kevin Chen-Chuan Chang,et al.  Conjunctive constraint mapping for data translation , 1998, DL '98.

[7]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[8]  Chen Li,et al.  Generating efficient plans for queries using views , 2001, SIGMOD '01.

[9]  Clement T. Yu,et al.  WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce , 2003, VLDB.

[10]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[11]  Craig A. Knoblock,et al.  Query reformulation for dynamic information integration , 1996, Journal of Intelligent Information Systems.

[12]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[13]  Kevin Chen-Chuan Chang,et al.  On-the-Fly Constraint Mapping across Web Query Interfaces , 2004 .

[14]  Laura M. Haas,et al.  Data-driven understanding and refinement of schema mappings , 2001, SIGMOD '01.

[15]  Pedro M. Domingos,et al.  Representing and reasoning about mappings between domain models , 2002, AAAI/IAAI.

[16]  Laura M. Haas,et al.  The Clio project: managing heterogeneity , 2001, SGMD.

[17]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[18]  Jeffrey D. Ullman,et al.  Information integration using logical views , 1997, Theor. Comput. Sci..

[19]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[20]  Ling Liu,et al.  Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web , 2004, Proceedings. 20th International Conference on Data Engineering.

[21]  Catriel Beeri,et al.  Rewriting queries using views in description logics , 1997, PODS '97.

[22]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[23]  Robert Sandy,et al.  Statistics for Business and Economics , 1989 .

[24]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[25]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[26]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[27]  Kevin Chen-Chuan Chang,et al.  Knocking the door to the deep Web: integrating Web query interfaces , 2004, SIGMOD '04.

[28]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[29]  Tao Tao,et al.  Organizing structured web sources by query schemas: a clustering approach , 2004, CIKM '04.

[30]  Jeffrey D. Ullman,et al.  MedMaker: a mediation system based on declarative specifications , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[31]  Kim Marriott Constraint multiset grammars , 1994, Proceedings of 1994 IEEE Symposium on Visual Languages.

[32]  Jiawei Han,et al.  Discovering complex matchings across web query interfaces: a correlation mining approach , 2004, KDD.

[33]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[34]  Kevin Chen-Chuan Chang,et al.  Making Holistic Schema Matching Robust: An Ensemble Framework with Sampling and Voting , 2004 .