Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems

A data integration system offers a single interface to multiple structured data sources. Many application contexts (e.g., searching structured data on the web) involve the integration of large numbers of structured data sources. At web scale, it is impractical to use manual or semi-automatic data integration methods, so a pay-as-you-go approach is more appropriate. A pay-as-you-go approach entails using a fully automatic approximate data integration technique to provide an initial data integration system (i.e., an initial mediated schema, and initial mappings from source schemas to the mediated schema), and then refining the system as it gets used. Previous research has investigated automatic approximate data integration techniques, but all existing techniques require the schemas being integrated to belong to the same conceptual domain. At web scale, it is impractical to classify schemas into domains manually or semi-automatically, which limits the applicability of these techniques. In this paper, we present an approach for clustering schemas into domains without any human intervention and based only on the names of attributes in the schemas. Our clustering approach deals with uncertainty in assigning schemas to domains using a probabilistic model. We also propose a query classifier that determines, for a given a keyword query, the most relevant domains to this query. We experimentally demonstrate the effectiveness of our schema clustering and query classification techniques.

[1]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[2]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[3]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[4]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[5]  Clement T. Yu,et al.  Merging interface schemas on the deep Web via clustering aggregation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[6]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[7]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[8]  Ashraf Aboulnaga,et al.  μBE: User Guided Source Selection and Schema Mediation for Internet Scale Data Integration , 2007, IEEE International Conference on Data Engineering.

[9]  Edward A. Fox,et al.  Recent Developments in Document Clustering , 2007 .

[10]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[11]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[12]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[13]  Tao Tao,et al.  Organizing structured web sources by query schemas: a clustering approach , 2004, CIKM '04.

[14]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[15]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[16]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[17]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[18]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[20]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[21]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[22]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[23]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[24]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[25]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[26]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[27]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[28]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[29]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[30]  David G. Stork,et al.  Pattern Classification , 1973 .

[31]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[32]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[33]  Christian S. Jensen,et al.  Google fusion tables: data management, integration and collaboration in the cloud , 2010, SoCC '10.