Ensemble correlation-based low-rank matrix completion with applications to traffic data imputation

Abstract Low-rank matrix completion (LRMC) is a recently emerging technique which has achieved promising performance in many real-world applications, such as traffic data imputation. In order to estimate missing values, the current LRMC based methods optimize the rank of the matrix comprising the whole traffic data, potentially assuming that all traffic data is equally important. As a result, it puts more emphasis on the commonality of traffic data while ignoring its subtle but crucial difference due to different locations of loop detectors as well as dates of sampling. To handle this problem and further improve imputation performance, a novel correlation-based LRMC method is proposed in this paper. Firstly, LRMC is applied to get initial estimations of missing values. Then, a distance matrix containing pairwise distance between samples is built based on a weighted Pearson's correlation which strikes a balance between observed values and imputed values. For a specific sample, its most similar samples based on the distance matrix constructed are chosen by using an adaptive K-nearest neighboring (KNN) search. LRMC is then applied on these samples with much stronger correlation to obtain refined estimations of missing values. Finally, we also propose a simple but effective ensemble learning strategy to integrate multiple imputed values for a specific sample for further improving imputation performance. Extensive numerical experiments are performed on both traffic flow volume data as well as standard benchmark datasets. The results confirm that the proposed correlation-based LRMC and its ensemble learning version achieve better imputation performance than competing methods.

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[3]  Xiaobo Chen,et al.  Structural max-margin discriminant analysis for feature extraction , 2014, Knowl. Based Syst..

[4]  Bingru Yang,et al.  A SVM Regression Based Approach to Filling in Missing Values , 2005, KES.

[5]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[6]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[7]  Yi Zhang,et al.  A BPCA based missing value imputing method for traffic flow volume data , 2008, 2008 IEEE Intelligent Vehicles Symposium.

[8]  Jian Yang,et al.  Recursive robust least squares support vector regression based on maximum correntropy criterion , 2012, Neurocomputing.

[9]  Hong-Bin Shen,et al.  Towards better accuracy for missing value estimation of epistatic miniarray profiling data by a novel ensemble approach. , 2011, Genomics.

[10]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[11]  Yi Zhang,et al.  PPCA-Based Missing Data Imputation for Traffic Flow Volume: A Systematical Approach , 2009, IEEE Transactions on Intelligent Transportation Systems.

[12]  Zuduo Zheng,et al.  Short-term traffic volume forecasting : a k-nearest neighbor approach enhanced by constrained linearly sewing principle component algorithm , 2014 .

[13]  Shiqian Ma,et al.  Fixed point and Bregman iterative methods for matrix rank minimization , 2009, Math. Program..

[14]  M.N. Noor,et al.  Mean Imputation Techniques for Filling the Missing Observations in Air Pollution Dataset , 2013 .

[15]  Ito Wasito,et al.  Least squares algorithms with nearest neighbour techniques for imputing missing data values. , 2003 .

[16]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[17]  Md Zahidul Islam,et al.  Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques , 2013, Knowl. Based Syst..

[18]  Mecit Cetin,et al.  Short-term traffic flow rate forecasting based on identifying similar traffic patterns , 2016 .

[19]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[20]  Fang Liu,et al.  On Missing Traffic Data Imputation Based on Fuzzy C-Means Method by Considering Spatial–Temporal Correlation , 2015 .

[21]  Hamid Reza Karimi,et al.  Missing Value Estimation for Microarray Data by Bayesian Principal Component Analysis and Iterative Local Least Squares , 2013 .

[22]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2009, Found. Comput. Math..

[23]  Muhammad Tayyab Asif,et al.  Matrix and Tensor Based Methods for Missing Data Estimation in Large Traffic Networks , 2016, IEEE Transactions on Intelligent Transportation Systems.

[24]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[25]  Yi Zhang,et al.  Spatial-temporal traffic data analysis based on global data management using MAS , 2004, IEEE Trans. Intell. Transp. Syst..

[26]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[27]  Fei-Yue Wang,et al.  Traffic Flow Prediction With Big Data: A Deep Learning Approach , 2015, IEEE Transactions on Intelligent Transportation Systems.

[28]  Alípio Mário Jorge,et al.  Ensemble approaches for regression: A survey , 2012, CSUR.

[29]  Jian Yang,et al.  A flexible support vector machine for regression , 2011, Neural Computing and Applications.

[30]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[31]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[32]  Xingyu Wang,et al.  Sparse Bayesian Classification of EEG for Brain–Computer Interface , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[33]  B. Ran,et al.  Traffic Missing Data Completion With Spatial-temporal Correlations , 2014 .

[34]  Bin Ran,et al.  Traffic Speed Data Imputation Method Based on Tensor Completion , 2015, Comput. Intell. Neurosci..

[35]  Guangdong Feng,et al.  A Tensor Based Method for Missing Traffic Data Completion , 2013 .

[36]  Muhammad Tayyab Asif,et al.  Low-dimensional models for missing data imputation in road networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Xingyu Wang,et al.  Sparse Bayesian multiway canonical correlation analysis for EEG pattern recognition , 2017, Neurocomputing.

[38]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[39]  Jian Yang,et al.  Recursive projection twin support vector machine via within-class variance minimization , 2011, Pattern Recognit..

[40]  Jaideep Srivastava,et al.  Automatic instance selection via locality constrained sparse representation for missing value estimation , 2015, Knowl. Based Syst..

[41]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[42]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[43]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[44]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.