Eliminating Duplicates in Information Integration: An Adaptive, Extensible Framework

Data cleaning is an inevitable problem when integrating data from distributed operational databases, because no unified set of standards spans all the distributed sources. One of the most challenging phases of data cleaning is removing fuzzy duplicate records. Approximate or fuzzy duplicates pertain to two or more tuples that describe the same real-world entity using different syntaxes. In other words, they have the same semantics but different syntaxes. Eliminating fuzzy duplicates is applicable in any database but is critical in data-integration and analytical-processing domains, which involve data warehouses, data mining applications, and decision support systems. Earlier approaches, which required hard coding rules based on a schema, were time consuming and tedious, and you couldn't later adapt the rules. We propose a novel duplicate-elimination framework which exploits fuzzy inference and includes unique machine learning capabilities to let users clean their data flexibly and effortlessly without requiring any coding

[1]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[2]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[3]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[4]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[5]  Lotfi A. Zadeh,et al.  From Computing with Numbers to Computing with Words - from Manipulation of Measurements to Manipulation of Perceptions , 2005, Logic, Thought and Action.

[6]  Tok Wang Ling,et al.  A knowledge-based approach for duplicate elimination in data cleaning , 2001, Inf. Syst..

[7]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[8]  Chuen-Tsai Sun,et al.  Neuro-fuzzy modeling and control , 1995, Proc. IEEE.

[9]  Hugh Glaser,et al.  Managing Reference: Ensuring Referential Integrity of Ontologies for the Semantic Web , 2002, EKAW.

[10]  Jyh-Shing Roger Jang,et al.  ANFIS: adaptive-network-based fuzzy inference system , 1993, IEEE Trans. Syst. Man Cybern..

[11]  Stephen L. Chiu,et al.  Fuzzy Model Identification Based on Cluster Estimation , 1994, J. Intell. Fuzzy Syst..

[12]  R. Guha,et al.  Semantic Negotiation : Co-identifying objects across data sources , 2004 .

[13]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[14]  E. H. Mamdani,et al.  Advances in the linguistic synthesis of fuzzy controllers , 1976 .

[15]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[16]  Michio Sugeno,et al.  Fuzzy identification of systems and its applications to modeling and control , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[17]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .