Structure Feature Learning Method for Incomplete Data

Learning with incomplete data remains challenging in many real-world applications especially when the data is high-dimensional and dynamic. Many imputation-based algorithms have been proposed to handle with incomplete data, where these algorithms use statistics of the historical information to remedy the missing parts. However, these methods merely use the structural information existing in the data, which are very helpful for sharing between the complete entries and the missing ones. For example, in traffic system, some group information and temporal smoothness exist in the data structure. In this paper, we propose to incorporate these structural information and develop structural feature leaning method for learning with incomplete data (SFLIC). The SFLIC model adopt a fused Lasso based regularizer and a group Lasso style regularizer to enlarge the data sharing along both the temporal smoothness level and the feature group level to fill the gap where the data entries are missing. The proposed SFLIC model is a nonsmooth function according to the model parameters, and we adopt the smoothing proximal gradient (SPG) method to seek for an efficient solution. We evaluate our model on both synthetic and real-world highway traffic datasets. Experimental results show that our method outperforms the state-of-the-art methods.

[1]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[2]  Lei Han,et al.  Overlapping decomposition for causal graphical modeling , 2012, KDD.

[3]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[4]  Nadia Magnenat-Thalmann,et al.  Restoring corrupted motion capture data via jointly low-rank matrix completion , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[5]  Michele Modugno,et al.  Maximum Likelihood Estimation of Factor Models on Data Sets with Arbitrary Pattern of Missing Data , 2010, SSRN Electronic Journal.

[6]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[7]  Tung Khac Truong,et al.  Chemical reaction optimization with greedy strategy for the 0-1 knapsack problem , 2013, Appl. Soft Comput..

[8]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[9]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2009, Found. Comput. Math..

[10]  Jieping Ye,et al.  Feature grouping and selection over an undirected graph , 2012, KDD.

[11]  Kenli Li,et al.  A Hybrid Chemical Reaction Optimization Scheme for Task Scheduling on Heterogeneous Computing Systems , 2015, IEEE Transactions on Parallel and Distributed Systems.

[12]  Dinggang Shen,et al.  Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion , 2014, NeuroImage.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  Xu Zhou,et al.  Adaptive Processing for Distributed Skyline Queries over Uncertain Data , 2016, IEEE Transactions on Knowledge and Data Engineering.

[15]  R. Palmer,et al.  Missing Data? Plan on It! , 2010, Journal of the American Geriatrics Society.

[16]  Benjamin Recht,et al.  A Simpler Approach to Matrix Completion , 2009, J. Mach. Learn. Res..

[17]  Susan E. Hardy,et al.  Missing Data: A Special Challenge in Aging Research , 2009, Journal of the American Geriatrics Society.

[18]  Li Li,et al.  Efficient missing data imputing for traffic flow by considering temporal and spatial dependence , 2013 .

[19]  Ke Lu,et al.  Missing data imputation by K nearest neighbours based on grey relational structure and mutual information , 2015, Applied Intelligence.

[20]  Phil D. Green,et al.  Missing data techniques for robust speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[22]  Yunjun Gao,et al.  Top-k Dominating Queries on Incomplete Data , 2016, IEEE Trans. Knowl. Data Eng..

[23]  Xi Chen,et al.  Smoothing proximal gradient method for general structured sparse regression , 2010, The Annals of Applied Statistics.

[24]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[25]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[26]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[27]  Wooi-Boon Goh,et al.  A new spatio-temporal MRF model for the detection of missing data in image sequences , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Jon Kleinberg,et al.  Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter , 2011, WWW.

[29]  Seung-Jae Lee,et al.  Dynamic OD Estimation Using Three Phase Traffic Flow Theory , 2011 .

[30]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[31]  Shiqian Ma,et al.  Fixed point and Bregman iterative methods for matrix rank minimization , 2009, Math. Program..

[32]  Angshuman Guin,et al.  Multiple Imputation Scheme for Overcoming the Missing Values and Variability Issues in ITS Data , 2005 .

[33]  Paul M. Thompson,et al.  Multi-source learning with block-wise missing data for Alzheimer's disease prediction , 2013, KDD.

[34]  Lynne E. Parker,et al.  Nearest neighbor imputation using spatial-temporal correlations in wireless sensor networks , 2014, Inf. Fusion.

[35]  Kenli Li,et al.  A genetic algorithm for task scheduling on heterogeneous computing systems using multiple priority queues , 2014, Inf. Sci..

[36]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[37]  Yi Zhang,et al.  PPCA-Based Missing Data Imputation for Traffic Flow Volume: A Systematical Approach , 2009, IEEE Transactions on Intelligent Transportation Systems.

[38]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[39]  Trevor Hastie,et al.  Imputing Missing Data for Gene Expression Arrays , 2001 .

[40]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .