Big Data Pre-Processing: Closing the Data Quality Enforcement Loop

In the Big Data Era, data is the core for any governmental, institutional, and private organization. Efforts were geared towards extracting highly valuable insights that cannot happen if data is of poor quality. Therefore, data quality (DQ) is considered as a key element in Big data processing phase. In this stage, low quality data is not penetrated to the Big Data value chain. This paper, addresses the data quality rules discovery (DQR) after the evaluation of quality and prior to Big Data pre-processing. We propose a DQR discovery model to enhance and accurately target the pre-processing activities based on quality requirements. We defined, a set of pre-processing activities associated with data quality dimensions (DQD's) to automatize the DQR generation process. Rules optimization are applied on validated rules to avoid multi-passes pre-processing activities and eliminates duplicate rules. Conducted experiments showed an increased quality scores after applying the discovered and optimized DQR's on data.

[1]  Klas Michael,et al.  Quality Evaluation for Big Data: A Scalable Assessment Approach and First Evaluation Results , 2016 .

[2]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[3]  Yang W. Lee,et al.  Crafting Rules: Context-Reflective Data Quality Problem Solving , 2003, J. Manag. Inf. Syst..

[4]  Hamidah Ibrahim,et al.  Data quality: A survey of data quality dimensions , 2012, 2012 International Conference on Information Retrieval & Knowledge Management.

[5]  Mario Piattini,et al.  A Data Quality in Use model for Big Data , 2016, Future Gener. Comput. Syst..

[6]  Ali Sunyaev,et al.  Process-Driven Data Quality Management -- An Application of the Combined Conceptual Life Cycle Model , 2014, 2014 47th Hawaii International Conference on System Sciences.

[7]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[8]  Peter Z. Yeh,et al.  An Efficient and Robust Approach for Discovering Data Quality Rules , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[9]  Mohamed Adel Serhani,et al.  Big Data Quality: A Quality Dimensions Evaluation , 2016, 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld).