Establishing value mappings using statistical models and user feedback

In this paper, we present a "value mapping" algorithm that does not rely on syntactic similarity or semantic interpretation of the values. The algorithm first constructs a statistical model (e.g., co-occurrence frequency or entropy vector) that captures the unique characteristics of values and their co-occurrence. It then finds the matching values by computing the distances between the models while refining the models using user feedback through iterations. Our experimental results suggest that our approach successfully establishes value mappings even in the presence of opaque data values and thus can be a useful addition to the existing data integration techniques.

[1]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[2]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[3]  Dennis Shasha,et al.  An extensible Framework for Data Cleaning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[4]  Michael Evans,et al.  Introduction to the Practice of Statistics Minitab Manual and Minitab Version 14 , 2005 .

[5]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[6]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[7]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[8]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[9]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[10]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[11]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[12]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[13]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[14]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[15]  Luis Gravano,et al.  Text joins for data cleansing and integration in an RDBMS , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[16]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[17]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[18]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[19]  Renée J. Miller,et al.  Information-theoretic tools for mining database structure from large data sets , 2004, SIGMOD '04.

[20]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[21]  David Lindley,et al.  Introduction to the Practice of Statistics , 1990, The Mathematical Gazette.

[22]  Alvaro E. Monge,et al.  Adaptive detection of approximately duplicate database records and the database integration approach to information discovery , 1998 .

[23]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[24]  AnHai Doan,et al.  iMAP: Discovering Complex Mappings between Database Schemas. , 2004, SIGMOD 2004.

[25]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[26]  Erhard Rahm,et al.  On Matching Schemas Automatically , 2001 .

[27]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[28]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[29]  Erhard Rahm,et al.  Comparison of Schema Matching Evaluations , 2002, Web, Web-Services, and Database Systems.

[30]  H. J. Arnold Introduction to the Practice of Statistics , 1990 .

[31]  Laura M. Haas,et al.  Clio: a semi-automatic tool for schema mapping , 2001, SIGMOD '01.

[32]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.