State of the Art on the Quality of Big Data: A Systematic Literature Review and Classification Framework

One of the most significant problems of Big Data is to extract knowledge through the huge amount of data. The usefulness of the extracted information depends strongly on data quality. In addition to the importance, data quality has recently been taken into consideration by the big data community and there is not any comprehensive review conducted in this area. Therefore, the purpose of this study is to review and present the state of the art on the quality of big data research through a hierarchical framework. The dimensions of the proposed framework cover various aspects in the quality assessment of Big Data including 1) the processing types of big data, i.e. stream, batch, and hybrid, 2) the main task, and 3) the method used to conduct the task. We compare and critically review all of the studies reported during the last ten years through our proposed framework to identify which of the available data quality assessment methods have been successfully adopted by the big data community. Finally, we provide a critical discussion on the limitations of existing methods and offer suggestions on potential valuable research directions that can be taken in future research in this domain.

[1]  Carlo Batini,et al.  From Data Quality to Big Data Quality , 2015, J. Database Manag..

[2]  Mei Bai,et al.  An efficient algorithm for distributed density-based outlier detection on big data , 2016, Neurocomputing.

[3]  Xiaoli Meng,et al.  A Big Data Online Cleaning Algorithm Based on Dynamic Outlier Detection , 2015, 2015 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[4]  Anazida Zainal,et al.  A distributed anomaly detection model for wireless sensor networks based on the one-class principal component classifier , 2018, Int. J. Sens. Networks.

[5]  Md Zahidul Islam,et al.  Missing value imputation using a fuzzy clustering-based EM approach , 2015, Knowledge and Information Systems.

[6]  Marko Vukolic,et al.  Bleach: A Distributed Stream Data Cleaning System , 2017, 2017 IEEE International Congress on Big Data (BigData Congress).

[7]  Wei Dai Stream data quality assessment based on distributed computing platforms , 2016 .

[8]  Mouzhi Ge,et al.  Big Data for Internet of Things: A Survey , 2018, Future Gener. Comput. Syst..

[9]  Victor O. K. Li,et al.  Low-rank singular value thresholding for recovering missing air quality data , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[10]  Zhe Chen,et al.  Anomaly Detection and Redundancy Elimination of Big Sensor Data in Internet of Things , 2017, ArXiv.

[11]  Hajar Mousannif,et al.  A model-driven framework for data quality management in the Internet of Things , 2018, J. Ambient Intell. Humaniz. Comput..

[12]  Annie Ibrahim Rana,et al.  Anomaly Detection Guidelines for Data Streams in Big Data , 2016, 2016 3rd International Conference on Soft Computing & Machine Intelligence (ISCMI).

[13]  Li Haosong,et al.  Data quality assessment for on-line monitoring and measuring system of power quality based on big data and data provenance theory , 2018, 2018 IEEE 3rd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

[14]  Rachida Dssouli,et al.  Big Data Pre-processing: A Quality Framework , 2015, 2015 IEEE International Congress on Big Data.

[15]  Mohamed Adel Serhani,et al.  Big Data Quality: A Survey , 2018, 2018 IEEE International Congress on Big Data (BigData Congress).

[16]  Weiwei Liu,et al.  A Big Data Framework for Electric Power Data Quality Assessment , 2017, 2017 14th Web Information Systems and Applications Conference (WISA).

[17]  Pekka Pääkkönen,et al.  Evaluating the Quality of Social Media Data in Big Data Architecture , 2015, IEEE Access.

[18]  Yu Xiang,et al.  A data stream outlier detection algorithm based on grid , 2015, The 27th Chinese Control and Decision Conference (2015 CCDC).

[19]  Sandra Geisler,et al.  Ontology-based data quality framework for data stream applications , 2011, ICIQ.

[20]  Joe H. Chow,et al.  Modelless Data Quality Improvement of Streaming Synchrophasor Measurements by Exploiting the Low-Rank Hankel Structure , 2018, IEEE Transactions on Power Systems.

[21]  Yangyong Zhu,et al.  The Challenges of Data Quality and Data Quality Assessment in the Big Data Era , 2015, Data Sci. J..

[22]  Martin Meckesheimer,et al.  Automatic outlier detection for time series: an application to sensor data , 2007, Knowledge and Information Systems.

[23]  Chris Jermaine,et al.  Real-time High Performance Anomaly Detection over Data Streams: Grand Challenge , 2017, DEBS.

[24]  Jerry Zeyu Gao,et al.  Big Data Validation and Quality Assurance -- Issuses, Challenges, and Needs , 2016, 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE).

[25]  Chongcheng Chen,et al.  Data quality analysis and cleaning strategy for wireless sensor networks , 2018, EURASIP J. Wirel. Commun. Netw..

[26]  Matthias Jarke,et al.  Ontology-Based Data Quality Management for Data Streams , 2016, ACM J. Data Inf. Qual..

[27]  Danilo Ardagna,et al.  Context-aware data quality assessment for big data , 2018, Future Gener. Comput. Syst..

[28]  Nirvana Meratnia,et al.  Outlier Detection Techniques for Wireless Sensor Networks: A Survey , 2008, IEEE Communications Surveys & Tutorials.

[29]  Per Runeson,et al.  Guidelines for conducting and reporting case study research in software engineering , 2009, Empirical Software Engineering.

[30]  Klas Michael,et al.  Quality Evaluation for Big Data: A Scalable Assessment Approach and First Evaluation Results , 2016 .

[31]  Elisa Bertino,et al.  A trust assessment framework for streaming data in WSNs using iterative filtering , 2015, 2015 IEEE Tenth International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP).

[32]  Le Gruenwald,et al.  Online outlier detection for data streams , 2011, IDEAS '11.

[33]  Madhu Shukla,et al.  Analysis and evaluation of outlier detection algorithms in data streams , 2015, 2015 International Conference on Computer, Communication and Control (IC4).

[34]  Le Gruenwald,et al.  Online detection of outliers for data streams , 2013 .

[35]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[36]  A. Anusha,et al.  A Study on Outlier Detection for Temporal Data , 2018 .

[37]  Shari Lawrence Pfleeger,et al.  Preliminary Guidelines for Empirical Research in Software Engineering , 2002, IEEE Trans. Software Eng..

[38]  Yong Xiang,et al.  Protection of Big Data Privacy , 2016, IEEE Access.

[39]  Chao Li,et al.  Scene-Based Big Data Quality Management Framework , 2018, ICPCSEE.

[40]  He Liu,et al.  An Electric Power Sensor Data Oriented Data Cleaning Solution , 2017, 2017 14th International Symposium on Pervasive Systems, Algorithms and Networks & 2017 11th International Conference on Frontier of Computer Science and Technology & 2017 Third International Symposium of Creative Computing (ISPAN-FCST-ISCC).

[41]  K. P. Supreethi,et al.  Adaptive Pre-processing and Regression of Weather Data , 2017 .

[42]  Meike Klettke,et al.  Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores , 2015, BTW.

[43]  Adriana Marotta,et al.  Data Stream Quality Evaluation for the Generation of Alarms in the Health Domain , 2015, J. Intell. Syst..

[44]  Wolfgang Lehner,et al.  Representing Data Quality in Sensor Data Streaming Environments , 2009, JDIQ.

[45]  Cinzia Cappiello,et al.  Quality awareness for a Successful Big Data Exploitation , 2018, IDEAS.

[46]  David J. Hill,et al.  Anomaly detection in streaming environmental sensor data: A data-driven modeling approach , 2010, Environ. Model. Softw..

[47]  Wolfgang Lehner,et al.  Representing Data Quality for Streaming and Static Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[48]  Divesh Srivastava,et al.  Data quality: The other face of Big Data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[49]  Dietmar Pfahl,et al.  Reporting Experiments in Software Engineering , 2008, Guide to Advanced Empirical Software Engineering.

[50]  Brian Lee,et al.  Context aware model-based cleaning of data streams , 2015, 2015 26th Irish Signals and Systems Conference (ISSC).

[51]  Jens Lehmann,et al.  Quality assessment for Linked Data: A Survey , 2015, Semantic Web.

[52]  Suriani Mohd Sam,et al.  Data Quality in Big Data: A Review , 2015, SOCO 2015.

[53]  Anazida Zainal,et al.  Adaptive and online data anomaly detection for wireless sensor systems , 2014, Knowl. Based Syst..

[54]  Jun Huang,et al.  An in-network data cleaning approach for wireless sensor networks , 2016, Intell. Autom. Soft Comput..

[55]  Cheong Hee Park,et al.  Anomaly Pattern Detection on Data Streams , 2018, 2018 IEEE International Conference on Big Data and Smart Computing (BigComp).

[56]  Lei Cao,et al.  Distributed Top-N local outlier detection in big data , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[57]  Jan Bosch,et al.  Customer Feedback and Data Collection Techniques in Software R&D: A Literature Review , 2015, ICSOB.

[58]  Xavier Franch,et al.  A software reference architecture for semantic-aware Big Data systems , 2017, Inf. Softw. Technol..

[59]  Miriam A. M. Capretz,et al.  Contextual Anomaly Detection in Big Sensor Data , 2014, 2014 IEEE International Congress on Big Data.

[60]  Ali Ahmadian Ramaki,et al.  A systematic review on intrusion detection based on the Hidden Markov Model , 2018, Stat. Anal. Data Min..

[61]  Charu C. Aggarwal,et al.  Outlier Detection for Temporal Data: A Survey , 2014, IEEE Transactions on Knowledge and Data Engineering.

[62]  Ruth N. Bolton,et al.  A Multistage Model of Customers' Assessments of Service Quality and Value , 1991 .

[63]  Suraj Juddoo,et al.  Overview of data quality challenges in the context of Big Data , 2015, 2015 International Conference on Computing, Communication and Security (ICCCS).

[64]  Brian Lee,et al.  A Framework for Distributed Cleaning of Data Streams , 2015, ANT/SEIT.

[65]  Marijn Janssen,et al.  Antecedents of big data quality: An empirical examination in financial service organizations , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[66]  Anazida Zainal,et al.  Advancements of Data Anomaly Detection Research in Wireless Sensor Networks: A Survey and Open Issues , 2013, Sensors.

[67]  Vijay V. Raghavan,et al.  Data quality issues in big data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[68]  Jacques Demerjian,et al.  Assessing and Improving Sensors Data Quality in Streaming Context , 2017, ICCCI.

[69]  Le Gruenwald,et al.  In pursuit of outliers in multi-dimensional data streams , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[70]  Chong Wang,et al.  A Data Cleaning Model for Electric Power Big Data Based on Spark Framework , 2016, AST 2016.

[71]  Valentina Janev,et al.  Big data and quality: A literature review , 2016, 2016 24th Telecommunications Forum (TELFOR).

[72]  Y. Zhang,et al.  – 20 Statistics-based outlier detection for wireless sensor networks , 2012 .

[73]  Philip Woodall,et al.  A hybrid approach to assessing data quality , 2010, ICIQ.

[74]  Salima Benbernou,et al.  Enhancing data quality by cleaning inconsistent big RDF data , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[75]  Bill McMullen,et al.  Big data, big data quality problem , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[76]  Sitthapon Pumpichet,et al.  Novel online data cleaning protocols for data streams in trajectory, Wireless Sensor Networks , 2013 .

[77]  Anja Klein Incorporating quality aspects in sensor data streams , 2007, PIKM '07.

[78]  Wei Peng,et al.  Monitoring and analyzing customer feedback through social media platforms for identifying and remedying customer problems , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[79]  Francisco Herrera,et al.  Big data preprocessing: methods and prospects , 2016 .

[80]  Miriam A. M. Capretz,et al.  Contextual anomaly detection framework for big sensor data , 2015, Journal of Big Data.

[81]  Hajar Mousannif,et al.  Data quality in internet of things: A state-of-the-art survey , 2016, J. Netw. Comput. Appl..

[82]  Wilfred Ng,et al.  A model-based approach for RFID data stream cleansing , 2012, CIKM.

[83]  Quan Z. Sheng,et al.  Cleaning Environmental Sensing Data Streams Based on Individual Sensor Reliability , 2014, WISE.

[84]  Niki Pissinou,et al.  Ensemble stream model for data-cleaning in sensor networks , 2015, SIGAI.

[85]  Marie-Luce Picard,et al.  Computing data quality indicators on Big Data streams using a CEP , 2015, 2015 International Workshop on Computational Intelligence for Multimedia Understanding (IWCIM).

[86]  Li Li,et al.  HMM-based predictive model for enhancing data quality in WSN , 2017 .

[87]  Shu Gao,et al.  Research on real-time outlier detection over big data streams , 2017 .

[88]  Edward Curry,et al.  Automatic Anomaly Detection over Sliding Windows: Grand Challenge , 2017, DEBS.

[89]  Mohamed Adel Serhani,et al.  An Hybrid Approach to Quality Evaluation across Big Data Value Chain , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[90]  Pengcheng Zhang,et al.  Data quality in big data processing: Issues, solutions and open problems , 2017, 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[91]  Mohamed Adel Serhani,et al.  Big Data Pre-Processing: Closing the Data Quality Enforcement Loop , 2017, 2017 IEEE International Congress on Big Data (BigData Congress).

[92]  João Paulo Costa,et al.  Integrating Decision Support and Social Networks , 2012, Adv. Hum. Comput. Interact..

[93]  Madhu Shukla,et al.  A survey of outlier detection algorithms for data streams , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[94]  Paul Van Dooren,et al.  Iterative Filtering in Reputation Systems , 2010, SIAM J. Matrix Anal. Appl..

[95]  Nirvana Meratnia,et al.  Distributed online outlier detection in wireless sensor networks using ellipsoidal support vector machine , 2013, Ad Hoc Networks.

[96]  Mohamed Abid,et al.  Outlier detection approaches for wireless sensor networks: A survey , 2017, Comput. Networks.

[97]  Mario Piattini,et al.  A Data Quality in Use model for Big Data , 2016, Future Gener. Comput. Syst..

[98]  James P. Rogers Detection of Outliers in Spatial-temporal Data A , 2010 .

[99]  Pekka Pääkkönen,et al.  Quality management architecture for social media data , 2017, Journal of Big Data.