A Survey on Data Cleaning Methods in Cyberspace

Cyberspace offers users and information communication systems the opportunity to interact with each other for business. Data, as the carrier of information, represents the processing content of different business work. In order to improve the quality of data, data cleaning plays an important role in various cyberspace scenarios, such as RFID and sensor, ETL process etc. This paper presents a survey of the art-of-the-state data cleaning methods in cyberspace. According to the characteristics of data cleaning, we extract the relevant key elements of cyberspace to classify the existing works. After elaborating and analyzing each category, we summarize the description and challenges of data cleaning core technologies, such as data quality rules, models and crowdsourcing. Furthermore, we give suggestions for future research on data cleaning in cyberspace from both technology and interactive perspective.

[1]  Jian Pei,et al.  Cleaning disguised missing data: a heuristic approach , 2007, KDD '07.

[2]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[3]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[4]  Tim Kraska,et al.  Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views , 2015, Proc. VLDB Endow..

[5]  Gustavo Alonso,et al.  A Pipelined Framework for Online Cleaning of Sensor Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  Anish Das Sarma,et al.  Data Cleaning: A Practical Perspective , 2013, Data Cleaning: A Practical Perspective.

[7]  Arup Kumar Bhattacharjee,et al.  ETL based Cleaning on Database , 2014 .

[8]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[9]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[10]  Lei Chen,et al.  CrowdCleaner: Data cleaning for multi-version data on the web via crowdsourcing , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[11]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[12]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[13]  Tova Milo,et al.  Query-Oriented Data Cleaning with Oracles , 2015, SIGMOD Conference.

[14]  Paolo Papotti,et al.  Interactive and Deterministic Data Cleaning , 2016, SIGMOD Conference.

[15]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[16]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models , 2016, ArXiv.

[17]  Minos N. Garofalakis,et al.  Adaptive cleaning for RFID data streams , 2006, VLDB.

[18]  Chong Wang,et al.  A Data Cleaning Model for Electric Power Big Data Based on Spark Framework , 2016, AST 2016.

[19]  Niki Pissinou,et al.  Ensemble stream model for data-cleaning in sensor networks , 2015, SIGAI.

[20]  Hong Liu,et al.  Cleaning Framework for Big Data - Object Identification and Linkage , 2015, 2015 IEEE International Congress on Big Data.

[21]  Renée J. Miller,et al.  Continuous data cleaning , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[22]  Laks V. S. Lakshmanan,et al.  Data Cleaning and Query Answering with Matching Dependencies and Matching Functions , 2010, ICDT '11.

[23]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[24]  Divesh Srivastava,et al.  Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[25]  Wai Yin Mok,et al.  A Security Price Data Cleaning Technique: Reynold's Decomposition Approach , 2015, SIMBig.

[26]  Paolo Papotti,et al.  The LLUNATIC Data-Cleaning Framework , 2013, Proc. VLDB Endow..

[27]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[28]  Laks V. S. Lakshmanan,et al.  Data cleaning and query answering with matching dependencies and matching functions , 2011, ICDT '11.

[29]  Sanjay Krishnan,et al.  Wisteria: Nurturing Scalable Data Cleaning Infrastructure , 2015, Proc. VLDB Endow..

[30]  Paolo Papotti,et al.  Descriptive and prescriptive data cleaning , 2014, SIGMOD Conference.

[31]  Jianzhong Li,et al.  CerFix: A System for Cleaning Data with Certain Fixes , 2011, Proc. VLDB Endow..

[32]  Lida Xu,et al.  Data Cleaning for RFID and WSN Integration , 2014, IEEE Transactions on Industrial Informatics.

[33]  Joseph M. Hellerstein,et al.  Quantitative Data Cleaning for Large Databases , 2008 .

[34]  Purnamrita Sarkar,et al.  Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning , 2014, Proc. VLDB Endow..