Sequential Data Cleaning: A Statistical Approach

Errors are prevalent in data sequences, such as GPS trajectories or sensor readings. Existing methods on cleaning sequential data employ a constraint on value changing speeds and perform constraint-based repairing. While such speed constraints are effective in identifying large spike errors, the small errors that do not significantly deviate from the truth and indeed satisfy the speed constraints can hardly be identified and repaired. To handle such small errors, in this paper, we propose a statistical based cleaning method. Rather than declaring a broad constraint of max/min speeds, we model the probability distribution of speed changes. The repairing problem is thus to maximize the likelihood of the sequence w.r.t. the probability of speed changes. We formalize the likelihood-based cleaning problem, show its NP-hardness, devise exact algorithms, and propose several approximate/heuristic methods to trade off effectiveness for efficiency. Experiments on real data sets (in various applications) demonstrate the superiority of our proposal.

[1]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[2]  王建民 Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu. SCREEN: Stream Data Cleaning under Speed Constraints. ACM SIGMOD International Conference on Management of Data , 2015 .

[3]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[4]  Philip S. Yu,et al.  SCREEN: Stream Data Cleaning under Speed Constraints , 2015, SIGMOD Conference.

[5]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[6]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[7]  D. Brillinger Time series - data analysis and theory , 1981, Classics in applied mathematics.

[8]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[9]  J. Jackson Wiley Series in Probability and Mathematical Statistics , 2004 .

[10]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[11]  Avishek Saha,et al.  Sequential Dependencies , 2009, Proc. VLDB Endow..

[12]  Ahmed K. Elmagarmid,et al.  Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes , 2013, SIGMOD '13.

[13]  Chunping Li,et al.  Turn Waste into Wealth: On Simultaneous Clustering and Cleaning over Dirty Data , 2015, KDD.

[14]  Minos N. Garofalakis,et al.  Adaptive cleaning for RFID data streams , 2006, VLDB.

[15]  E. S. Gardner EXPONENTIAL SMOOTHING: THE STATE OF THE ART, PART II , 2006 .

[16]  Tamraparni Dasu,et al.  Statistical Distortion: Consequences of Data Cleaning , 2012, Proc. VLDB Endow..