Management of Inconsistencies in Data Integration

Data integration aims at providing a unified view over data coming from various sources. One of the most challenging tasks for data integration is handling the inconsistencies that appear in the integrated data in an efficient and effective manner. In this chapter, we provide a survey on techniques introduced for handling inconsistencies in data integration, focusing on two groups. The first group contains techniques for computing consistent query answers, and includes mechanisms for the compact representation of repairs, query rewriting, and logic programs. The second group contains techniques focusing on the resolution of inconsistencies. This includes methodologies for computing similarity between atomic values as well as similarity between groups of data, collective techniques, scaling to large datasets, and dealing with uncertainty that is related to inconsistencies.

[1]  Jef Wijsen,et al.  On the first-order expressibility of computing certain answers to conjunctive queries over uncertain databases , 2010, PODS '10.

[2]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[3]  Jef Wijsen,et al.  Database repairing using updates , 2005, TODS.

[4]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[5]  Andrea Calì,et al.  On the decidability and complexity of query answering over inconsistent and incomplete databases , 2003, PODS.

[6]  Thomas Eiter,et al.  Repair localization for query answering from inconsistent databases , 2008, TODS.

[7]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[8]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[9]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[10]  Ramanathan V. Guha,et al.  TAP: a Semantic Web platform , 2003, Comput. Networks.

[11]  Peter Fankhauser,et al.  Efficient entity resolution for large heterogeneous information spaces , 2011, WSDM '11.

[12]  Amit P. Sheth,et al.  Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection , 2006, WWW '06.

[13]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[14]  Alon Y. Halevy,et al.  Semantic Integration Research in the Database Community : A Brief Survey , 2005 .

[15]  Jan Chomicki,et al.  Computing consistent query answers using conflict hypergraphs , 2004, CIKM '04.

[16]  Phokion G. Kolaitis,et al.  Repair checking in inconsistent databases: algorithms and complexity , 2009, ICDT '09.

[17]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[18]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[19]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[20]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[21]  Divesh Srivastava,et al.  Flexible String Matching Against Large Databases in Practice , 2004, VLDB.

[22]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[23]  Wolfgang Faber,et al.  Declarative problem-solving in DLV , 2001 .

[24]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[25]  Claudia Niederée,et al.  On-the-fly entity-aware query processing in the presence of linkage , 2010, Proc. VLDB Endow..

[26]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[27]  Jan Chomicki,et al.  Consistent query answers in the presence of universal constraints , 2008, Inf. Syst..

[28]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[29]  Adrian Onet,et al.  Data correspondence, exchange and repair , 2010, ICDT '10.

[30]  Claudia Niederée,et al.  Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data , 2012, WSDM '12.

[31]  Nilesh N. Dalvi,et al.  Large-Scale Collective Entity Matching , 2011, Proc. VLDB Endow..

[32]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[33]  Thomas Eiter,et al.  Efficient Evaluation of Logic Programs for Querying Data Integration Systems , 2003, ICLP.

[34]  Xin He,et al.  Scalar aggregation in inconsistent databases , 2003, Theor. Comput. Sci..

[35]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[36]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[37]  Jef Wijsen,et al.  A remark on the complexity of consistent conjunctive query answering under primary key violations , 2010, Inf. Process. Lett..

[38]  Jiawei Han,et al.  Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[39]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[40]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[41]  Keizo Oyama,et al.  A Fast Linkage Detection Scheme for Multi-Source Information Integration , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[42]  Sanjay Chawla,et al.  Robust record linkage blocking using suffix arrays , 2009, CIKM.

[43]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[44]  William W. Cohen,et al.  Learning to Match and Cluster Entity Names , 2001 .

[45]  Phokion G. Kolaitis,et al.  On the tractability and intractability of consistent conjunctive query answering , 2011, PhD '11.

[46]  Leopoldo E. Bertossi,et al.  Semantically Correct Query Answers in the Presence of Null Values , 2006, EDBT Workshops.

[47]  Claudia Niederée,et al.  Eliminating the redundancy in blocking-based entity resolution methods , 2011, JCDL '11.

[48]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[49]  Sergio Greco,et al.  A Logic Programming Approach to the Integration, Repairing and Querying of Inconsistent Databases , 2001, ICLP.

[50]  John Mylopoulos,et al.  Modeling Concept Evolution: A Historical Perspective , 2009, ER.

[51]  Claudia Niederée,et al.  Probabilistic Entity Linkage for Heterogeneous Information Spaces , 2008, CAiSE.

[52]  Sergio Greco,et al.  A Logical Framework for Querying and Repairing Inconsistent Databases , 2003, IEEE Trans. Knowl. Data Eng..

[53]  Paolo Bouquet,et al.  Entity Identification on the Semantic Web , 2008, SWAP.

[54]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[55]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[56]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[57]  Lise Getoor,et al.  Link mining: a survey , 2005, SKDD.

[58]  Jan Chomicki,et al.  Hippo: A System for Computing Consistent Answers to a Class of SQL Queries , 2004, EDBT.

[59]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[60]  Lise Getoor,et al.  Deduplication and Group Detection using Links , 2004 .

[61]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[62]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[63]  Leopoldo E. Bertossi,et al.  Logic Programs for Querying Inconsistent Databases , 2003, PADL.

[64]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[65]  Oktie Hassanzadeh,et al.  Data Management Issues on the Semantic Web , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[66]  Jef Wijsen,et al.  Condensed Representation of Database Repairs for Consistent Query Answering , 2003, ICDT.

[67]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[68]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[69]  Renée J. Miller,et al.  First-order query rewriting for inconsistent databases , 2005, J. Comput. Syst. Sci..

[70]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[71]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[72]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[73]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[74]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[75]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[76]  Jan Chomicki,et al.  Answer sets for consistent query answering in inconsistent databases , 2002, Theory and Practice of Logic Programming.

[77]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[78]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[79]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[80]  Dan Roth,et al.  Semantic Integration in Text: From Ambiguous Names to Identifiable Entities , 2005, AI Mag..

[81]  Dirk Vermeir,et al.  Preferred Answer Sets for Ordered Logic Programs , 2002, JELIA.