Big Data Pre-processing: A Quality Framework

With the abundance of raw data generated from various sources, Big Data has become a preeminent approach in acquiring, processing, and analyzing large amounts of heterogeneous data to derive valuable evidences. The size, speed, and formats in which data is generated and processed affect the overall quality of information. Therefore, Quality of Big Data (QBD) has become an important factor to ensure that the quality of data is maintained at all Big data processing phases. This paper addresses the QBD at the pre-processing phase, which includes sub-processes like cleansing, integration, filtering, and normalization. We propose a QBD model incorporating processes to support Data quality profile selection and adaptation. In addition, it tracks and registers on a data provenance repository the effect of every data transformation happened in the pre-processing phase. We evaluate the data quality selection module using large EEG dataset. The obtained results illustrate the importance of addressing QBD at an early phase of Big Data processing lifecycle since it significantly save on costs and perform accurate data analysis.

[1]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[2]  Srividya Kona Bansal,et al.  Towards a Semantic Extract-Transform-Load (ETL) Framework for Big Data Integration , 2014, 2014 IEEE International Congress on Big Data.

[3]  Martin Hepp,et al.  Towards a vocabulary for data quality management in semantic web architectures , 2011, LWDM '11.

[4]  Divesh Srivastava,et al.  Data quality: The other face of Big Data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[5]  Martin J. Shepperd,et al.  Software productivity analysis of a large data set and issues of confidentiality and data quality , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[6]  Kalapriya Kannan,et al.  Enrichment Patterns for Big Data , 2014, 2014 IEEE International Congress on Big Data.

[7]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[8]  Janusz Wielki,et al.  The Opportunities and Challenges Connected with Implementation of the Big Data Concept , 2015, Advances in ICT for Business, Industry and Public Sector.

[9]  David Loshin Data enrichment/enhancement , 2001 .

[10]  Boris Glavic Big Data Provenance: Challenges and Implications for Benchmarking , 2012, WBDB.

[11]  Benjamin T. Hazen,et al.  Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications , 2014 .

[12]  Jef Wijsen,et al.  Determining the Currency of Data , 2011, TODS.

[13]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[14]  Hasimah Hj Mohamed,et al.  E-Clean: A Data Cleaning Framework for Patient Data , 2011, 2011 First International Conference on Informatics and Computational Intelligence.

[15]  Hamidah Ibrahim,et al.  Data quality: A survey of data quality dimensions , 2012, 2012 International Conference on Information Retrieval & Knowledge Management.

[16]  Andrian Marcus,et al.  Data Cleansing: A Prelude to Knowledge Discovery , 2005, Data Mining and Knowledge Discovery Handbook.

[17]  Luís Veiga,et al.  Towards quality-of-service driven consistency for Big Data management , 2014, Int. J. Big Data Intell..

[18]  Angélica Caro,et al.  An Approach To Design Business Processes Addressing Data Quality Issues , 2013, ECIS.

[19]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[20]  Heiko Mueller,et al.  Problems , Methods , and Challenges in Comprehensive Data Cleansing , 2005 .

[21]  Lakshmish Ramaswamy,et al.  Towards a Quality-centric Big Data Architecture for Federated Sensor Services , 2013, 2013 IEEE International Congress on Big Data.

[22]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[23]  Pedro Rangel Henriques,et al.  A Formal Definition of Data Quality Problems , 2005, ICIQ.

[24]  Ahmed K. Elmagarmid,et al.  NADEEF: A Generalized Data Cleaning System , 2013, Proc. VLDB Endow..

[25]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[26]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[27]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[28]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[29]  Karsten P. Ulland,et al.  Vii. References , 2022 .

[30]  M Markus Maier,et al.  Towards a big data reference architecture , 2013 .

[31]  Ching-Seh Wu,et al.  Provenance as a Service: A Data-centric Approach for Real-Time Monitoring , 2014, 2014 IEEE International Congress on Big Data.

[32]  Nan Tang,et al.  Big Data Cleaning , 2014, APWeb.

[33]  Lavanya Ramakrishnan,et al.  Milieu: Lightweight and Configurable Big Data Provenance for Science , 2013, 2013 IEEE International Congress on Big Data.

[34]  M. Anusha,et al.  Big Data-Survey , 2016 .

[35]  Ali Sunyaev,et al.  Process-Driven Data Quality Management -- An Application of the Combined Conceptual Life Cycle Model , 2014, 2014 47th Hawaii International Conference on System Sciences.

[36]  Peter Z. Yeh,et al.  An Efficient and Robust Approach for Discovering Data Quality Rules , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[37]  Guang-Zhong Yang,et al.  Multi-sensor Fusion , 2014, Body Sensor Networks.