Automatic instance selection via locality constrained sparse representation for missing value estimation

Missing values in real application can significantly disturb the result of knowledge discovery, and it is thus vital to estimate this unknown data accurately. This paper focuses on applying sparse representation to improve the quality of estimation of the absent values. Firstly, a novel sparse representation scheme called locality constrained sparse representation (LCSR) is presented, introducing locality l1-norm and l2-norm regularization. Taking the advantage of sparsity, smoothness and locality structure, LCSR is capable of automatically selecting instance and avoiding overfitting. Then LCSR-based missing value estimation (LCSR-MVE) is proposed to estimate the unobserved values through the linear combination of automatically selected atoms from dictionary due to the sparsity in reconstruction coefficient vector, while three dictionary constructions are also developed respectively. The proposed LCSR-MVE is evaluated on 6 datasets from UCI and gene expression databases, compared with other instance-based missing value estimation methods. Results show that the proposed LCSR-MVE outperforms other state-of-arts methods in terms of normalized root mean squared error (NRMSE), and is not much sensitive to the dictionary size and regularization parameters.

[1]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[2]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[3]  Wan-Chi Siu,et al.  Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data , 2012, Pattern Recognit..

[4]  Md Zahidul Islam,et al.  Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques , 2013, Knowl. Based Syst..

[5]  Taesung Park,et al.  Robust imputation method for missing values in microarray data , 2007, BMC Bioinformatics.

[6]  Ahmet Arslan,et al.  A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm , 2013, Inf. Sci..

[7]  Honggang Zhang,et al.  Local Sparse Representation Based Classification , 2010, 2010 20th International Conference on Pattern Recognition.

[8]  Chun Chen,et al.  Active Learning Based on Locally Linear Reconstruction , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  T. H. Bø,et al.  LSimpute: accurate estimation of missing values in microarray data with least squares methods. , 2004, Nucleic acids research.

[10]  Yi Ma,et al.  The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices , 2010, Journal of structural biology.

[11]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[12]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[13]  Lígia P. Brás,et al.  Improving cluster-based missing value estimation of DNA microarray data. , 2007, Biomolecular engineering.

[14]  K. Lange,et al.  Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.

[15]  Guang Deng,et al.  Kernel PCA regression for missing data estimation in DNA microarray analysis , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[16]  Wang Ling,et al.  Estimation of Missing Values Using a Weighted K-Nearest Neighbors Algorithm , 2009, 2009 International Conference on Environmental Science and Information Application Technology.

[17]  Vadlamani Ravi,et al.  A new online data imputation method based on general regression auto associative neural network , 2014, Neurocomputing.

[18]  G. Sapiro,et al.  A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography. , 2013, Journal of structural biology.

[19]  Kaberi Das,et al.  Removal and interpolation of missing values using wavelet neural network for heterogeneous data sets , 2012, ICACCI '12.

[20]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[21]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[22]  Estevam R. Hruschka,et al.  Towards Efficient Imputation by Nearest-Neighbors: A Clustering-Based Approach , 2004, Australian Conference on Artificial Intelligence.

[23]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[24]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Xiaofeng Song,et al.  Sequential local least squares imputation estimating missing value of microarray data , 2008, Comput. Biol. Medicine.

[26]  Pilsung Kang,et al.  Locally linear reconstruction based missing value imputation for supervised learning , 2013, Neurocomputing.

[27]  Ki-Yeol Kim,et al.  Reuse of imputed data in microarray analysis increases imputation efficiency , 2004, BMC Bioinformatics.

[28]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[30]  Larry S. Davis,et al.  Submodular dictionary learning for sparse coding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Allen Y. Yang,et al.  Fast ℓ1-minimization algorithms and an application in robust face recognition: A review , 2010, 2010 IEEE International Conference on Image Processing.

[32]  Y. Huang,et al.  Local PCA Regression for Missing Data Estimation in Telecommunication Dataset , 2010, PRICAI.

[33]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[34]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[35]  Jiuchao Feng,et al.  KFCE: A dictionary generation algorithm for sparse representation , 2009, Signal Process..

[36]  Fabrício Olivetti de França,et al.  Predicting missing values with biclustering: A coherence-based approach , 2013, Pattern Recognit..

[37]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[38]  Sungzoon Cho,et al.  Locally linear reconstruction for instance-based learning , 2008, Pattern Recognit..

[39]  YanWang,et al.  Missing value estimation for microarray data based on fuzzy C-means clustering , 2005, Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05).

[40]  Huan Liu Feature Selection , 2010, Encyclopedia of Machine Learning.

[41]  Zhiguo Chang,et al.  HKC: A Dictionary Training Algorithm for Sparse Representation , 2010, 2010 International Conference on Multimedia Information Networking and Security.

[42]  Roderick J. A. Little,et al.  The Analysis of Social Science Data with Missing Values , 1989 .

[43]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale $\ell_1$-Regularized Least Squares , 2007, IEEE Journal of Selected Topics in Signal Processing.

[44]  Arkadi Nemirovski,et al.  Dual subgradient algorithms for large-scale nonsmooth learning problems , 2013, Math. Program..

[45]  Fen Qin,et al.  Dynamic Methods for Missing Value Estimation for DNA Sequences , 2010, 2010 International Conference on Computational and Information Sciences.

[46]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[47]  Guillermo Sapiro,et al.  Sparse Representation for Computer Vision and Pattern Recognition , 2010, Proceedings of the IEEE.

[48]  René Vidal,et al.  Sparse subspace clustering , 2009, CVPR.

[49]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[50]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.