NADEEF: a commodity data cleaning system

Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e. detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system.

[1]  Hector Garcia-Molina,et al.  Generic entity resolution with negative rules , 2009, The VLDB Journal.

[2]  Gabriel M. Kuper,et al.  Aggregation in Constraint Databases , 1993, PPCP.

[3]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[4]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[5]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[6]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[7]  Jef Wijsen,et al.  Database repairing using updates , 2005, TODS.

[8]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[9]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, The VLDB Journal.

[10]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[11]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[12]  Toby Walsh,et al.  Handbook of Satisfiability: Volume 185 Frontiers in Artificial Intelligence and Applications , 2009 .

[13]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[14]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .

[15]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[16]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[17]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[18]  Sharad Malik,et al.  Zchaff2004: An Efficient SAT Solver , 2004, SAT (Selected Papers.

[19]  Laks V. S. Lakshmanan,et al.  On approximating optimum repairs for functional dependency violations , 2009, ICDT '09.

[20]  Claudia Niederée,et al.  Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data , 2012, WSDM '12.

[21]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[22]  Jan Chomicki,et al.  Answer sets for consistent query answering in inconsistent databases , 2002, Theory and Practice of Logic Programming.

[23]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[24]  Shuai Ma,et al.  Increasing the Expressivity of Conditional Functional Dependencies without Extra Complexity , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[25]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[26]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[27]  Felix Naumann,et al.  Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies , 2006, IEEE Data Eng. Bull..

[28]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[29]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[30]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.