Big data quality framework: a holistic approach to continuous quality management

Big Data is an essential research area for governments, institutions, and private agencies to support their analytics decisions. Big Data refers to all about data, how it is collected, processed, and analyzed to generate value-added data-driven insights and decisions. Degradation in Data Quality may result in unpredictable consequences. In this case, confidence and worthiness in the data and its source are lost. In the Big Data context, data characteristics, such as volume, multi-heterogeneous data sources, and fast data generation, increase the risk of quality degradation and require efficient mechanisms to check data worthiness. However, ensuring Big Data Quality (BDQ) is a very costly and time-consuming process, since excessive computing resources are required. Maintaining Quality through the Big Data lifecycle requires quality profiling and verification before its processing decision. A BDQ Management Framework for enhancing the pre-processing activities while strengthening data control is proposed. The proposed framework uses a new concept called Big Data Quality Profile. This concept captures quality outline, requirements, attributes, dimensions, scores, and rules. Using Big Data profiling and sampling components of the framework, a faster and efficient data quality estimation is initiated before and after an intermediate pre-processing phase. The exploratory profiling component of the framework plays an initial role in quality profiling; it uses a set of predefined quality metrics to evaluate important data quality dimensions. It generates quality rules by applying various pre-processing activities and their related functions. These rules mainly aim at the Data Quality Profile and result in quality scores for the selected quality attributes. The framework implementation and dataflow management across various quality management processes have been discussed, further some ongoing work on framework evaluation and deployment to support quality evaluation decisions conclude the paper.

[1]  Paolo Ciancarini,et al.  Big Data Quality: A Roadmap for Open Data , 2016, 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService).

[2]  Mohamed Adel Serhani,et al.  Big Data Pre-Processing: Closing the Data Quality Enforcement Loop , 2017, 2017 IEEE International Congress on Big Data (BigData Congress).

[3]  Mario Piattini,et al.  I8K|DQ-BigData: I8K Architecture Extension for Data Quality in Big Data , 2015, ER Workshops.

[4]  Wen-Jyi Hwang,et al.  Fast kNN classification algorithm based on partial distance search , 1998 .

[5]  Carlo Batini,et al.  On the Meaningfulness of “Big Data Quality” (Invited Paper) , 2015, Data Science and Engineering.

[6]  Meina Song,et al.  Survey on data quality , 2012, 2012 World Congress on Information and Communication Technologies.

[7]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[8]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[9]  Juan Carlos Corrales,et al.  How to Address the Data Quality Issues in Regression Models: A Guided Process for Data Cleaning , 2018, Symmetry.

[10]  Jun Long,et al.  Data Profiling Technology of Data Governance Regarding Big Data: Review and Rethinking , 2016 .

[11]  Daniel Pakkala,et al.  Reference Architecture and Classification of Technologies, Products and Services for Big Data Systems , 2015, Big Data Res..

[12]  Mouzhi Ge,et al.  Quality Management in Big Data , 2018, Informatics.

[13]  Ibrar Yaqoob,et al.  A survey of big data management: Taxonomy and state-of-the-art , 2016, J. Netw. Comput. Appl..

[14]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[15]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[16]  Divesh Srivastava,et al.  Data quality: The other face of Big Data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[17]  Martin Hepp,et al.  Towards a vocabulary for data quality management in semantic web architectures , 2011, LWDM '11.

[18]  Mario Piattini,et al.  CALDEA: a data quality model based on maturity levels , 2003, Third International Conference on Quality Software, 2003. Proceedings..

[19]  Mario Piattini,et al.  A Data Quality in Use Model for Big Data - (Position Paper) , 2014, ER Workshops.

[20]  Yangyong Zhu,et al.  The Challenges of Data Quality and Data Quality Assessment in the Big Data Era , 2015, Data Sci. J..

[21]  Vijay V. Raghavan,et al.  Data quality issues in big data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[22]  Janusz Wielki,et al.  The Opportunities and Challenges Connected with Implementation of the Big Data Concept , 2015, Advances in ICT for Business, Industry and Public Sector.

[23]  Haoxiang Lin,et al.  An Empirical Study on Quality Issues of Production Big Data Platform , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[24]  Navarun Gupta,et al.  Seven V's of Big Data understanding Big Data to extract value , 2014, Proceedings of the 2014 Zone 1 Conference of the American Society for Engineering Education.

[25]  Philip Woodall,et al.  Data quality assessment: The Hybrid Approach , 2013, Inf. Manag..

[26]  Ahmed K. Elmagarmid,et al.  NADEEF: A Generalized Data Cleaning System , 2013, Proc. VLDB Endow..

[27]  Boris Glavic Big Data Provenance: Challenges and Implications for Benchmarking , 2012, WBDB.

[28]  Felix Naumann,et al.  Data profiling revisited , 2014, SGMD.

[29]  Peter Buneman,et al.  Data provenance – the foundation of data quality , 2010 .

[30]  Andy Koronios,et al.  An Investigation of How Data Quality is Affected by Dataset Size in the Context of Big Data Analytics , 2014, ICIQ.

[31]  Hamidah Ibrahim,et al.  Data quality: A survey of data quality dimensions , 2012, 2012 International Conference on Information Retrieval & Knowledge Management.

[32]  Peter Z. Yeh,et al.  An Efficient and Robust Approach for Discovering Data Quality Rules , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[33]  I. Dowman,et al.  Theme issue “State-of-the-art in photogrammetry, remote sensing and spatial information science” , 2016 .

[34]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[35]  Angélica Caro,et al.  An Approach To Design Business Processes Addressing Data Quality Issues , 2013, ECIS.

[36]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[37]  Mohamed Adel Serhani,et al.  Big Data Quality: A Quality Dimensions Evaluation , 2016, 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld).

[38]  Ali Sunyaev,et al.  Process-Driven Data Quality Management -- An Application of the Combined Conceptual Life Cycle Model , 2014, 2014 47th Hawaii International Conference on System Sciences.

[39]  Jorge Bernardino,et al.  A Survey on Data Quality: Classifying Poor Data , 2015, 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC).

[40]  Ahmed K. Elmagarmid,et al.  NADEEF/ER: generic and interactive entity resolution , 2014, SIGMOD Conference.

[41]  Jacek Maslankowski Data Quality Issues Concerning Statistical Data Gathering Supported by Big Data Technology , 2014, BDAS.

[42]  Shichao Zhang,et al.  Efficient kNN classification algorithm for big data , 2016, Neurocomputing.

[43]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[44]  Rachida Dssouli,et al.  Big Data Pre-processing: A Quality Framework , 2015, 2015 IEEE International Congress on Big Data.

[45]  Graham Cormode,et al.  Sampling for big data: a tutorial , 2014, KDD.

[46]  Vijay Gadepally,et al.  Sampling operations on big data , 2015, 2015 49th Asilomar Conference on Signals, Systems and Computers.

[47]  Yang W. Lee,et al.  Crafting Rules: Context-Reflective Data Quality Problem Solving , 2003, J. Manag. Inf. Syst..

[48]  Michael Kläs,et al.  Quality Evaluation for Big Data: A Scalable Assessment Approach and First Evaluation Results , 2016, 2016 Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement (IWSM-MENSURA).

[49]  Jie Li,et al.  Rethinking big data: A review on the data quality and usage issues , 2016 .

[50]  Belén Ruíz-Mezcua,et al.  Towards a big data framework for analyzing social media content , 2019, Int. J. Inf. Manag..

[51]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[52]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[53]  Faming Liang,et al.  A Bootstrap Metropolis–Hastings Algorithm for Bayesian Analysis of Big Data , 2016, Technometrics.

[54]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[55]  Mario Piattini,et al.  A Data Quality Measurement Information Model Based On ISO/IEC 15939 , 2007, ICIQ.

[56]  A. Satyanarayana Intelligent sampling for big data using bootstrap sampling and chebyshev inequality , 2014, 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE).

[57]  Suraj Juddoo,et al.  Overview of data quality challenges in the context of Big Data , 2015, 2015 International Conference on Computing, Communication and Security (ICCCS).

[58]  M Markus Maier,et al.  Towards a big data reference architecture , 2013 .

[59]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[60]  Purnamrita Sarkar,et al.  The Big Data Bootstrap , 2012, ICML.

[61]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[62]  Bill McMullen,et al.  Big data, big data quality problem , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[63]  Benjamin T. Hazen,et al.  Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications , 2014 .

[64]  Pedro Rangel Henriques,et al.  A Formal Definition of Data Quality Problems , 2005, ICIQ.

[65]  On Meaningfulness , 2019, On the End of Privacy.

[66]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[67]  Publisher's Note , 2018, Anaesthesia.

[68]  Jianwu Wang,et al.  Big data provenance: Challenges, state of the art and opportunities , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[69]  G. Priya,et al.  EFFICIENT KNN CLASSIFICATION ALGORITHM FOR BIG DATA , 2017 .

[70]  Robert K. Cunningham,et al.  Computing on masked data: a high performance method for improving big data veracity , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[71]  Brigitte Laboisse,et al.  An Evaluation Framework For Data Quality Tools , 2007, ICIQ.

[72]  Nan Tang,et al.  Big Data Cleaning , 2014, APWeb.