Object Matching for Information Integration: A Profiler-Based Approach

Object matching is a fundamental problem that arises in numerous information integration scenarios. Virtually all existing solutions to this problem have assumed that the objects to be matched share the same set of attributes, and that they can be matched by comparing the similarities of the attributes. We consider the more general problem where the objects can also have disjoint attributes, such as matching tuples that come from relational tables with schemas (age,name) and (name,salary), respectively. We describe PROM, a solution that also exploits the disjoint attributes to improve matching accuracy. In the above example, PROM begins by matching any two given tuples based on the shared attribute name. Then it applies a set of profilers, each of which contains some knowledge about what constitutes a typical person. The profilers examine the tuple pair to see if it can plausibly make up a person. For example, a profiler may state that because the age is 9 and the salary is 200K, the tuples do not make up a person and thus do not match. Profilers can be manually specified by domain experts, learned from training data, transferred from other matching tasks, or constructed from external data. Thus, the PROM approach is distinguished in that it not only can exploit disjoint attributes to improve matching accuracy, but can also reuse knowledge from previous object matching tasks.

[1]  C. Lee Giles,et al.  Autonomous citation matching , 1999, AGENTS '99.

[2]  Amihai Motro,et al.  Database Schema Matching Using Machine Learning with Feature Selection , 2002, CAiSE.

[3]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[4]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[5]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[6]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[7]  Dan Roth,et al.  Probabilistic Reasoning for Entity & Relation Recognition , 2002, COLING.

[8]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[9]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[10]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[11]  William W. Cohen,et al.  Learning to Match and Cluster Entity Names , 2001 .

[12]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[13]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[14]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[15]  R. Mooney,et al.  Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases , 2002 .

[16]  Arnon Rosenthal,et al.  Data Integration Needs an Industrial Revolution , 2001 .

[17]  Daniel Kudenko,et al.  Transferring and Retraining Learned Information Filters , 1997, AAAI/IAAI.

[18]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[19]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[20]  Dayne Freitag,et al.  Multistrategy Learning for Information Extraction , 1998, ICML.

[21]  Dennis Shasha,et al.  An extensible Framework for Data Cleaning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[22]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[23]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[24]  Luis Gravano,et al.  Text joins for data cleansing and integration in an RDBMS , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).