A Review on Data Cleansing Methods for Big Data

Abstract Massive amounts of data are available for the organization which will influence their business decision. Data collected from the various resources are dirty and this will affect the accuracy of prediction result. Data cleansing offers a better data quality which will be a great help for the organization to make sure their data is ready for the analyzing phase. However, the amount of data collected by the organizations has been increasing every year, which is making most of the existing methods no longer suitable for big data. Data cleansing process mainly consists of identifying the errors, detecting the errors and corrects them. Despite the data need to be analyzed quickly, the data cleansing process is complex and time-consuming in order to make sure the cleansed data have a better quality of data. The importance of domain expert in data cleansing process is undeniable as verification and validation are the main concerns on the cleansed data. This paper reviews the data cleansing process, the challenge of data cleansing for big data and the available data cleansing methods.

[1]  David K Vawdrey,et al.  Challenges Associated With Using Large Data Sets for Quality Assessment and Research in Clinical Settings , 2015, Policy, politics & nursing practice.

[2]  C. V. S. Rao,et al.  Data Cleaning: A Framework for Robust Data Quality In Enterprise Data Warehouse , 2012 .

[3]  Paolo Papotti,et al.  KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing , 2015, Proc. VLDB Endow..

[4]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[5]  Beng Chin Ooi,et al.  In-memory Databases: Challenges and Opportunities From Software and Hardware Perspectives , 2015, SGMD.

[6]  Jianzhong Li,et al.  Cleanix: a Parallel Big Data Cleaning System , 2016, SGMD.

[7]  Ben Shneiderman,et al.  Sharpening Analytic Focus to Cope with Big Data Volume and Variety , 2015, IEEE Computer Graphics and Applications.

[8]  Hiba Jasim Hadi,et al.  BIG DATA AND FIVE V'S CHARACTERISTICS , 2014 .

[9]  Peter Géczy,et al.  BIG DATA CHARACTERISTICS , 2014 .

[10]  Divesh Srivastava,et al.  Data quality: The other face of Big Data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[11]  Hamidah Ibrahim,et al.  Data quality: A survey of data quality dimensions , 2012, 2012 International Conference on Information Retrieval & Knowledge Management.

[12]  K Walunj Swapnil,et al.  Big Data: Characteristics, Challenges and Data Mining , 2016 .

[13]  Miriam A. M. Capretz,et al.  Machine Learning With Big Data: Challenges and Approaches , 2017, IEEE Access.

[14]  Suriani Mohd Sam,et al.  Data Quality in Big Data: A Review , 2015, SOCO 2015.