Mind the gap

Recording sensor data is seldom a perfect process. Failures in power, communication or storage can leave occasional blocks of data missing, affecting not only real-time monitoring but also compromising the quality of nearand off-line data analysis. Several recovery (imputation) algorithms have been proposed to replace missing blocks. Unfortunately, little is known about their relative performance, as existing comparisons are limited to either a small subset of relevant algorithms or to very few datasets or often both. Drawing general conclusions in this case remains a challenge. In this paper, we empirically compare twelve recovery algorithms using a novel benchmark. All but two of the algorithms were re-implemented in a uniform test environment. The benchmark gathers ten different datasets, which collectively represent a broad range of applications. Our benchmark allows us to fairly evaluate the strengths and weaknesses of each approach, and to recommend the best technique on a use-case basis. It also allows us to identify the limitations of the current body of algorithms and suggest future research directions. PVLDB Reference Format: Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, and Philippe Cudré-Mauroux. Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series. PVLDB, 13(5): 768-782, 2020. DOI: https://doi.org/10.14778/3377369.3377383

[1]  Yaohang Li,et al.  Faster Matrix Completion Using Randomized SVD , 2018, 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI).

[2]  Yan Liu,et al.  Spatial-temporal causal modeling for climate change attribution , 2009, KDD.

[3]  N Radhakrishnan,et al.  Estimating regularity in epileptic seizure time-series data. A complexity-measure approach. , 1998, IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society.

[4]  Jiayu Zhou,et al.  Multi-Task Learning based Survival Analysis for Predicting Alzheimer's Disease Progression with Multi-Source Block-wise Missing Data , 2018, SDM.

[5]  Zhihua Wang,et al.  Fast algorithms for time series with applications to finance, physics, music, biology, and other suspects , 2004, SIGMOD '04.

[6]  Gary Carpenter 동적 사용자를 위한 Scalable 인증 그룹 키 교환 프로토콜 , 2005 .

[7]  Mourad Khayati,et al.  Memory-efficient centroid decomposition for long time series , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[8]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[9]  Samuel Madden,et al.  Query Optimization for Dynamic Imputation , 2017, Proc. VLDB Endow..

[10]  Zili Zhang,et al.  Multi-view Weak-label Learning based on Matrix Completion , 2018, SDM.

[11]  Eamonn J. Keogh,et al.  Matrix Profile IV: Using Weakly Labeled Time Series to Predict Outcomes , 2017, Proc. VLDB Endow..

[12]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[13]  Philip S. Yu,et al.  Dimensionality Reduction and Filtering on Time Series Sensor Streams , 2013, Managing and Mining Sensor Data.

[14]  Andrea Montanari,et al.  Matrix Completion from Noisy Entries , 2009, J. Mach. Learn. Res..

[15]  Robert M. Gower,et al.  SGD with Arbitrary Sampling: General Analysis and Improved Rates , 2019, ICML.

[16]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[17]  Themis Palpanas,et al.  Scalable, Variable-Length Similarity Search in Data Series: The ULISSE Approach , 2018, Proc. VLDB Endow..

[18]  Jimeng Sun,et al.  Streaming Pattern Discovery in Multiple Time-Series , 2005, VLDB.

[19]  Yong Wang,et al.  SDA: Software-defined accelerator for large-scale DNN systems , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[20]  Inderjit S. Dhillon,et al.  Temporal Regularized Matrix Factorization for High-dimensional Time Series Prediction , 2016, NIPS.

[21]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[22]  Ge Yu,et al.  Order-Sensitive Imputation for Clustered Missing Values (Extended Abstract) , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[23]  Jinsung Yoon,et al.  Estimating Missing Data in Temporal Data Streams Using Multi-Directional Recurrent Neural Networks , 2017, IEEE Transactions on Biomedical Engineering.

[24]  Yuejie Chi,et al.  Streaming PCA and Subspace Tracking: The Missing Data Case , 2018, Proceedings of the IEEE.

[25]  Edward Y. Chang,et al.  Adaptive stream resource management using Kalman Filters , 2004, SIGMOD '04.

[26]  Feras Saad,et al.  A Probabilistic Programming Approach To Probabilistic Data Analysis , 2016, NIPS.

[27]  Andrea Montanari,et al.  Matrix completion from a few entries , 2009, 2009 IEEE International Symposium on Information Theory.

[28]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[29]  Alexander Vergara,et al.  On the calibration of sensor arrays for pattern recognition using the minimal number of experiments , 2014 .

[30]  Jaegul Choo,et al.  Simultaneous Discovery of Common and Discriminative Topics via Joint Nonnegative Matrix Factorization , 2015, KDD.

[31]  M. Rozložník,et al.  The loss of orthogonality in the Gram-Schmidt orthogonalization process , 2005 .

[32]  Michael H. Böhlen,et al.  Continuous Imputation of Missing Values in Streams of Pattern-Determining Time Series , 2017, EDBT.

[33]  Stephen J. Wright,et al.  Online algorithms for factorization-based structure from motion , 2013, IEEE Winter Conference on Applications of Computer Vision.

[34]  Katsiaryna Mirylenka,et al.  Characterizing Home Device Usage From Wireless Traffic Time Series , 2016, EDBT.

[35]  Yan Liu,et al.  Learning Temporal Causal Graphs for Relational Time-Series Analysis , 2010, ICML.

[36]  Xue Wang,et al.  Granger Causality between Multiple Interdependent Neurobiological Time Series: Blockwise versus Pairwise Methods , 2007, Int. J. Neural Syst..

[37]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[38]  Xavier Bresson,et al.  Robust Principal Component Analysis on Graphs , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Xavier Bresson,et al.  Matrix Completion on Graphs , 2014, NIPS 2014.

[40]  Tianrui Li,et al.  ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data , 2016, IJCAI.

[41]  Mourad Khayati,et al.  Scalable recovery of missing blocks in time series with high and low cross-correlations , 2019, Knowledge and Information Systems.

[42]  Minseok Lee,et al.  Missing-Value Imputation of Continuous Missing Based on Deep Imputation Network Using Correlations among Multiple IoT Data Streams in a Smart Space , 2019, IEICE Trans. Inf. Syst..

[43]  Narendra Ahuja,et al.  Robust Orthonormal Subspace Learning: Efficient Recovery of Corrupted Low-Rank Matrices , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Dimitris Bertsimas,et al.  From Predictive Methods to Missing Data Imputation: An Optimization Approach , 2017, J. Mach. Learn. Res..

[45]  Thomas Bartz-Beielstein,et al.  Comparison of different Methods for Univariate Time Series Imputation in R , 2015, ArXiv.

[46]  Dyah A. Hening Missing Data Imputation Method Comparison in Ohio University Student Retention Database , 2009 .

[47]  Shichao Zhang,et al.  The Journal of Systems and Software , 2012 .

[48]  M. Munich,et al.  Long-term variability of AGN at hard X-rays , 2013, 1311.4164.

[49]  Xiaoping Zhu,et al.  Comparison of Four Methods for Handing Missing Data in Longitudinal Data Analysis through a Simulation Study , 2014 .

[50]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[51]  Pavlos Protopapas,et al.  Computational Intelligence Challenges and Applications on Large-Scale Astronomical Time Series Databases , 2014, IEEE Computational Intelligence Magazine.

[52]  Amaury Lendasse,et al.  X-SOM and L-SOM: A double classification approach for missing value imputation , 2010, Neurocomputing.

[53]  Conrad Sanderson,et al.  A User-Friendly Hybrid Sparse Matrix Class in C++ , 2018, ICMS.

[54]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[55]  Cyrus Shahabi,et al.  Inferring Traffic Incident Start Time with Loop Sensor Data , 2016, CIKM.

[56]  Philip S. Yu,et al.  Privacy-Preserving Singular Value Decomposition , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[57]  David E. Keyes,et al.  Hierarchical Matrix Operations on GPUs , 2019, ACM Transactions on Mathematical Software.

[58]  Hongwei Liu,et al.  Solving non-negative matrix factorization by alternating least squares with a modified strategy , 2013, Data Mining and Knowledge Discovery.

[59]  Mohammed Abdullah,et al.  Robust online matrix completion on graphs , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Mourad Khayati,et al.  RecovDB: Accurate and Efficient Missing Blocks Recovery for Large Time Series , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[61]  Cédric Févotte,et al.  Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Felix Bießmann,et al.  "Deep" Learning for Missing Value Imputationin Tables with Non-Numerical Data , 2018, CIKM.

[63]  Mourad Khayati,et al.  Using Lowly Correlated Time Series to Recover Missing Values in Time Series: A Comparison Between SVD and CD , 2015, SSTD.

[64]  Yannig Goude,et al.  Nonnegative Matrix Factorization for Time Series Recovery From a Few Temporal Aggregates , 2017, ICML.

[65]  Philip Levis,et al.  Locality-Sensitive Hashing for Earthquake Detection: A Case Study Scaling Data-Driven Science , 2018, Proc. VLDB Endow..

[66]  Wei Cao,et al.  BRITS: Bidirectional Recurrent Imputation for Time Series , 2018, NeurIPS.

[67]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[68]  Christos Faloutsos,et al.  DynaMMo: mining and summarization of coevolving sequences with missing values , 2009, KDD.

[69]  Dejiao Zhang,et al.  Global Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation , 2015, AISTATS.

[70]  Dennis Shasha,et al.  Tuning Time Series Queries in Finance: Case Studies and Recommendations , 1999, IEEE Data Eng. Bull..

[71]  Xavier Bresson,et al.  Geometric Matrix Completion with Recurrent Multi-Graph Neural Networks , 2017, NIPS.

[72]  Moody T. Chu,et al.  The Centroid Decomposition: Relationships between Discrete Variational Decompositions and SVDs , 2001, SIAM J. Matrix Anal. Appl..

[73]  Vipin Kumar,et al.  Chapman & Hall/CRC Data Mining and Knowledge Discovery Series , 2008 .