Trends in Cleaning Relational Data: Consistency and Deduplication

Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. Poor data across businesses and the government cost the U.S. economy $3.1 trillion a year, according to a reportby InsightSquared in 2012. To detect data errors, data quality rules or integrity constraints ICs have been proposed as a declarative way to describe legal or correct data instances. Any subset of data that does not conform to the defined rules is considered erroneous, which is also referred to as a violation. Various kinds of data repairing techniques with different objectives have been introduced, where algorithms are used to detect subsets of the data that violate the declared integrity constraints, and even to suggest updates to the database such that the new database instance conforms with these constraints. While some of these algorithms aim to minimally change the database, others involve human experts or knowledge bases to verify the repairs suggested by the automatic repeating algorithms. In this paper, we discuss the main facets and directions in designing error detection and repairing techniques. We propose a taxonomy of current anomaly detection techniques, including error types, the automation of the detection process, and error propagation. We also propose a taxonomy of current data repairing techniques, including the repair target, the automation of the repair process, and the update model. We conclude by highlighting current trends in "big data" cleaning.

[1]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[2]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[3]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[4]  Gösta Grahne,et al.  The Problem of Incomplete Information in Relational Databases , 1991, Lecture Notes in Computer Science.

[5]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[6]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[7]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[8]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[9]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[10]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[11]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[12]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[13]  Ahmed K. Elmagarmid,et al.  Automating the approximate record-matching process , 2000, Inf. Sci..

[14]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[15]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[16]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[17]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[18]  Edward L. Robertson,et al.  FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances - Extended Abstract , 2001, DaWaK.

[19]  Hwee Tou Ng,et al.  A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[20]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[21]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[22]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[23]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[24]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[25]  Jiawei Han,et al.  Profile-Based Object Matching for Information Integration , 2003, IEEE Intell. Syst..

[26]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28]  Andrew McCallum,et al.  Conditional Models of Identity Uncertainty with Application to Noun Coreference , 2004, NIPS.

[29]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[30]  M. Charikar,et al.  Aggregating inconsistent information: ranking and clustering , 2005, STOC '05.

[31]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[32]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[33]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[34]  Laura M. Haas,et al.  Clio grows up: from research prototype to industrial tool , 2005, SIGMOD '05.

[35]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[36]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[37]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[38]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[39]  A. Karr Exploratory Data Mining and Data Cleaning , 2006 .

[40]  Leopoldo E. Bertossi,et al.  Complexity of Consistent Query Answering in Databases Under Cardinality-Based and Incremental Repair Semantics , 2006, ICDT.

[41]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[42]  R. Stockdale,et al.  Data Quality Information and Decision Making: A Healthcare Case Study , 2007 .

[43]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[44]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[45]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[46]  Surajit Chaudhuri,et al.  Example-driven design of efficient record matching queries , 2007, VLDB.

[47]  Jean-Marc Petit,et al.  Unary and n-ary inclusion dependency discovery in relational databases , 2009, Journal of Intelligent Information Systems.

[48]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[49]  Felix Naumann,et al.  Industry-scale duplicate detection , 2008, Proc. VLDB Endow..

[50]  Micha Elsner,et al.  You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement , 2008, ACL.

[51]  Joseph M. Hellerstein,et al.  Quantitative Data Cleaning for Large Databases , 2008 .

[52]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[53]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[54]  Phokion G. Kolaitis,et al.  Repair checking in inconsistent databases: algorithms and complexity , 2009, ICDT '09.

[55]  Avishek Saha,et al.  Metric Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[56]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[57]  Laks V. S. Lakshmanan,et al.  On approximating optimum repairs for functional dependency violations , 2009, ICDT '09.

[58]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[59]  M. Elsner,et al.  Bounding and Comparing Methods for Correlation Clustering Beyond ILP , 2009, ILP 2009.

[60]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[61]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[62]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[63]  Lei Chen,et al.  Discovering matching dependencies , 2009, CIKM.

[64]  Raghav Kaushik,et al.  On active learning of record matching packages , 2010, SIGMOD Conference.

[65]  Robert L. Surowka Modeling and querying possible repairs in duplicate detection , 2010 .

[66]  Lukasz Golab,et al.  Sampling the repairs of functional dependency violations under hard constraints , 2010, Proc. VLDB Endow..

[67]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[68]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[69]  Hector Garcia-Molina,et al.  Entity resolution with evolving rules , 2010, Proc. VLDB Endow..

[70]  Shuai Ma,et al.  Detecting inconsistencies in distributed data , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[71]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[72]  Leopoldo E. Bertossi,et al.  Database Repairing and Consistent Query Answering , 2011, Database Repairing and Consistent Query Answering.

[73]  Floris Geerts,et al.  Discovering Conditional Functional Dependencies , 2011, IEEE Transactions on Knowledge and Data Engineering.

[74]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[75]  Suman Nath,et al.  Tracing data errors with view-conditioned causality , 2011, SIGMOD '11.

[76]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[77]  Emanuel Santos,et al.  Support for User Involvement in Data Cleaning , 2011, DaWaK.

[78]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[79]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, The VLDB Journal.

[80]  Renée J. Miller,et al.  A unified model for data and constraint repair , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[81]  Jeffrey Heer,et al.  Proactive wrangling: mixed-initiative end-user programming of data transformation scripts , 2011, UIST.

[82]  Jean-Marc Petit,et al.  Discovering Editing Rules For Data Cleaning , 2012, VLDB 2012.

[83]  Sergio Greco,et al.  Incomplete Data and Data Dependencies in Relational Databases , 2012, Incomplete Data and Data Dependencies in Relational Databases.

[84]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[85]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[86]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[87]  Wenfei Fan,et al.  Foundations of Data Quality Management , 2012, Foundations of Data Quality Management.

[88]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[89]  Ashwin Machanavajjhala,et al.  An automatic blocking mechanism for large-scale de-duplication tasks , 2012, CIKM '12.

[90]  Michael Stonebraker,et al.  A Demonstration of DBWipes: Clean as You Query , 2012, Proc. VLDB Endow..

[91]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[92]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[93]  Lukasz Golab,et al.  Sampling from repairs of conditional functional dependency violations , 2014, The VLDB Journal.

[94]  Wenfei Fan,et al.  Inferring data currency and consistency for conflict resolution , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[95]  Samuel Madden,et al.  Scorpion: Explaining Away Outliers in Aggregate Queries , 2013, Proc. VLDB Endow..

[96]  Paolo Papotti,et al.  The LLUNATIC Data-Cleaning Framework , 2013, Proc. VLDB Endow..

[97]  Lukasz Golab,et al.  On the relative trust between inconsistent data and inaccurate constraints , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[98]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[99]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[100]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[101]  Paolo Papotti,et al.  That's All Folks! LLUNATIC Goes Open Source , 2014, Proc. VLDB Endow..

[102]  Wenfei Fan,et al.  Detecting Errors in Numeric Attributes , 2014, WAIM.

[103]  Yeye He,et al.  ClusterJoin: A Similarity Joins Framework using Map-Reduce , 2014, Proc. VLDB Endow..

[104]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[105]  Renée J. Miller,et al.  Continuous data cleaning , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[106]  Paolo Papotti,et al.  RuleMiner: Data quality rules discovery , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[107]  Nilesh N. Dalvi,et al.  Crowdsourcing Algorithms for Entity Resolution , 2014, Proc. VLDB Endow..

[108]  Shuai Ma,et al.  Extending inclusion dependencies with conditions , 2014, Theor. Comput. Sci..

[109]  Paolo Papotti,et al.  Mapping and cleaning , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[110]  Nan Tang,et al.  Towards dependable data repairing with fixing rules , 2014, SIGMOD Conference.

[111]  Paolo Papotti,et al.  Descriptive and prescriptive data cleaning , 2014, SIGMOD Conference.

[112]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[113]  Wenfei Fan,et al.  Conflict resolution with data currency and consistency , 2014, ACM J. Data Inf. Qual..

[114]  Jianzhong Li,et al.  Incremental Detection of Inconsistencies in Distributed Data , 2014, IEEE Trans. Knowl. Data Eng..

[115]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[116]  Tim Kraska,et al.  A sample-and-clean framework for fast and accurate query processing on dirty data , 2014, SIGMOD Conference.

[117]  Paolo Papotti,et al.  BigDansing: A System for Big Data Cleansing , 2015, SIGMOD Conference.

[118]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[119]  Jeffrey Heer,et al.  Predictive Interaction for Data Transformation , 2015, CIDR.

[120]  Felix Naumann,et al.  Divide & Conquer-based Inclusion Dependency Discovery , 2015, Proc. VLDB Endow..

[121]  Nan Tang,et al.  Proof positive and negative in data cleaning , 2015, 2015 IEEE 31st International Conference on Data Engineering.