Big Data Cleaning

Data cleaning is, in fact, a lively subject that has played an important part in the history of data management and data analytics, and it still is undergoing rapid development. Moreover, data cleaning is considered as a main challenge in the era of big data, due to the increasing volume, velocity and variety of data in many applications. This paper aims to provide an overview of recent work in different aspects of data cleaning: error detection methods, data repairing algorithms, and a generalized data cleaning system. It also includes some discussion about our efforts of data cleaning methods from the perspective of big data, in terms of volume, velocity and variety.

[1]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[2]  Ahmed K. Elmagarmid,et al.  NADEEF: A Generalized Data Cleaning System , 2013, Proc. VLDB Endow..

[3]  Jan Chomicki,et al.  Query Answering in Inconsistent Databases , 2003, Logics for Emerging Applications of Databases.

[4]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[5]  Wenfei Fan,et al.  Inferring data currency and consistency for conflict resolution , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[6]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, The VLDB Journal.

[7]  Laks V. S. Lakshmanan,et al.  Data Cleaning and Query Answering with Matching Dependencies and Matching Functions , 2010, ICDT '11.

[8]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[9]  Nan Tang,et al.  Towards dependable data repairing with fixing rules , 2014, SIGMOD Conference.

[10]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[11]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[12]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[13]  Paolo Papotti,et al.  The data analytics group at the qatar computing research institute , 2013, SGMD.

[14]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[15]  Jef Wijsen,et al.  Determining the Currency of Data , 2011, TODS.

[16]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[17]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[18]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[19]  Lukasz Golab,et al.  Sampling the repairs of functional dependency violations under hard constraints , 2010, Proc. VLDB Endow..

[20]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[21]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[22]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[23]  Laks V. S. Lakshmanan,et al.  Discovering Conditional Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[24]  Jianzhong Li,et al.  Incremental Detection of Inconsistencies in Distributed Data , 2012, IEEE Transactions on Knowledge and Data Engineering.

[25]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[26]  Laks V. S. Lakshmanan,et al.  On approximating optimum repairs for functional dependency violations , 2009, ICDT '09.

[27]  Shuai Ma,et al.  Extending Dependencies with Conditions , 2007, VLDB.

[28]  Shai Ben-David,et al.  Modeling and Querying Possible Repairs in Duplicate Detection , 2009, Proc. VLDB Endow..

[29]  Ahmed K. Elmagarmid,et al.  NADEEF/ER: generic and interactive entity resolution , 2014, SIGMOD Conference.