SparkDQ: Efficient generic big data quality management on distributed data-parallel computation

Abstract In the big data era, large amounts of data are under generation and accumulation in various industries. However, users usually feel hindered by the data quality issues when extracting values from the big data. Thus, data quality issues are gaining more and more attention from data quality management analysts. Cutting-edge solutions like data ETL, data cleaning, and data quality monitoring systems have many deficiencies in capability and efficiency, making it difficult to cope with complicated situations on big data. These problems inspire us to build SparkDQ, a generic distributed data quality management model and framework that provides a series of data quality detection and repair interfaces. Users can quickly build custom tasks of data quality computing for various needs by utilizing these interfaces. In addition, SparkDQ implements a set of algorithms that in a parallel manner with optimizations. These algorithms aim at various data quality goals. We also propose several system-level optimizations, including the job-level optimization with multi-task execution scheduling and the data-level optimization with data state caching. The experimental evaluation shows that the proposed distributed algorithms in SparkDQ run up to 12 times faster compared to the corresponding stand-alone serial and multi-thread algorithms. Compared with the cutting-edge distributed data quality solution Apache Griffin, SparkDQ has more features, and its execution time is only around half of Apache Griffin on average. SparkDQ achieves near-linear data and node scalability.

[1]  Xi Zhang,et al.  A Big Data Framework for Cloud Monitoring , 2016, 2016 IEEE/ACM 2nd International Workshop on Big Data Software Engineering (BIGDSE).

[2]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, Proc. VLDB Endow..

[3]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[4]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[5]  FanWenfei,et al.  Towards certain fixes with editing rules and master data , 2010, VLDB 2010.

[6]  Paolo Papotti,et al.  Interactive and Deterministic Data Cleaning , 2016, SIGMOD Conference.

[7]  Xu Chu,et al.  Rule-based data cleaning , 2019 .

[8]  Paolo Papotti,et al.  The LLUNATIC Data-Cleaning Framework , 2013, Proc. VLDB Endow..

[9]  Ahmed K. Elmagarmid,et al.  Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes , 2013, SIGMOD '13.

[10]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[11]  Panos Vassiliadis A Survey of Extract-Transform-Load Technology , 2009, Int. J. Data Warehous. Min..

[12]  Wenfei Fan,et al.  Data Quality: From Theory to Practice , 2015, SGMD.

[13]  Felix Bießmann,et al.  Unit Testing Data with Deequ , 2019, SIGMOD Conference.

[14]  Jian Li,et al.  Cleaning Relations Using Knowledge Bases , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[15]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[16]  Paolo Papotti,et al.  BigDansing: A System for Big Data Cleansing , 2015, SIGMOD Conference.

[17]  Andrei Romashchenko,et al.  An Operational Characterization of Mutual Information in Algorithmic Information Theory , 2019, J. ACM.

[18]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[19]  Yang-Xia Luo The Research of Bayesian Classifier Algorithms in Intrusion Detection System , 2010, 2010 International Conference on E-Business and E-Government.

[20]  Fei Tony Liu,et al.  Isolation-Based Anomaly Detection , 2012, TKDD.

[21]  Nan Tang,et al.  Towards dependable data repairing with fixing rules , 2014, SIGMOD Conference.

[22]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[23]  Nan Tang,et al.  Proof positive and negative in data cleaning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[24]  Ravinder Gupta Mastering Oracle GoldenGate , 2016, Apress.

[25]  Andreas Thor,et al.  Block-based load balancing for entity resolution with MapReduce , 2011, CIKM '11.

[26]  Roland Bouman,et al.  Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration , 2010 .