Mining complex matchings across Web query interfaces

To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sourcess. As a new attempt, this paper studies such matching as a data mining problem. Specifically, while complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this paper takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this "deep Web," query interfaces generally form complex matchings between attribute groups (e.g., {author} corresponds to {first name, last name} in the Books domain). We observe that the co-occurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., {first name, last name}) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach, which consists of dual mining of positive and negative correlations. We evaluate our approach on deep Web sources in several object domains (e.g., Books and Airfares) and the results show that the correlation mining approach does discover semantically meaningful matchings among attributes.

[1]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[2]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[3]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[4]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[5]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[6]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[7]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[8]  Tao Tao,et al.  Clustering Structured Web Sources: A Schema-Based, Model-Differentiation Approach , 2004, EDBT Workshops.

[9]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[10]  Jiawei Han,et al.  CoMine: efficient mining of correlated patterns , 2003, Third IEEE International Conference on Data Mining.

[11]  Laura M. Haas,et al.  Data-driven understanding and refinement of schema mappings , 2001, SIGMOD '01.

[12]  Leonard J. Seligman,et al.  Bulletin of the Technical Committee on Data Engineering September 2002 , 2002 .

[13]  Oren Etzioni,et al.  Crossing the Structure Chasm , 2003, CIDR.

[14]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[15]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[16]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[17]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[18]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.