Sparse optimization in feature selection: application in neuroimaging

Feature selection plays an important role in the successful application of machine learning techniques to large real-world datasets. Avoiding model overfitting, especially when the number of features far exceeds the number of observations, requires selecting informative features and/or eliminating irrelevant ones. Searching for an optimal subset of features can be computationally expensive. Functional magnetic resonance imaging (fMRI) produces datasets with such characteristics creating challenges for applying machine learning techniques to classify cognitive states based on fMRI data. In this study, we present an embedded feature selection framework that integrates sparse optimization for regularization (or sparse regularization) and classification. This optimization approach attempts to maximize training accuracy while simultaneously enforcing sparsity by penalizing the objective function for the coefficients of the features. This process allows many coefficients to become zero, which effectively eliminates their corresponding features from the classification model. To demonstrate the utility of the approach, we apply our framework to three different real-world fMRI datasets. The results show that regularized classifiers yield better classification accuracy, especially when the number of initial features is large. The results further show that sparse regularization is key to achieving scientifically-relevant generalizability and functional localization of classifier features. The approach is thus highly suited for analysis of fMRI data.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Stefan Pollmann,et al.  PyMVPA: a Python Toolbox for Multivariate Pattern Analysis of fMRI Data , 2009, Neuroinformatics.

[3]  Anders M. Dale,et al.  An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest , 2006, NeuroImage.

[4]  Young-Koo Lee,et al.  An Improved Maximum Relevance and Minimum Redundancy Feature Selection Algorithm Based on Normalized Mutual Information , 2010, 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet.

[5]  Andreas Krause,et al.  Near-optimal Nonmyopic Value of Information in Graphical Models , 2005, UAI.

[6]  Thomas E. Nichols,et al.  Handbook of Functional MRI Data Analysis: Index , 2011 .

[7]  Edoardo Amaldi,et al.  On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[8]  Andrew W. Moore,et al.  Logistic regression for data mining and high-dimensional classification , 2004 .

[9]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[10]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[11]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[12]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[13]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Le Song,et al.  Supervised feature selection via dependence estimation , 2007, ICML '07.

[15]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[16]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[17]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[18]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[19]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[20]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[21]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[22]  A. Ishai,et al.  Distinct, overlapping representations of faces and multiple categories of objects in ventral temporal cortex , 2001, NeuroImage.

[23]  Huan Liu,et al.  Semi-supervised Feature Selection via Spectral Analysis , 2007, SDM.

[24]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[25]  Russell A. Poldrack,et al.  Deconvolving BOLD activation in event-related designs for multivoxel pattern classification analyses , 2012, NeuroImage.

[26]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[27]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[28]  Stephen José Hanson,et al.  Combinatorial codes in ventral temporal lobe for object recognition: Haxby (2001) revisited: is there a “face” area? , 2004, NeuroImage.

[29]  W. Art Chaovalitwongse,et al.  Information-Theoretic Based Feature Selection for Multi-Voxel Pattern Analysis of fMRI Data , 2012, Brain Informatics.

[30]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[31]  Stephen M. Smith,et al.  Temporal Autocorrelation in Univariate Linear Modeling of FMRI Data , 2001, NeuroImage.

[32]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[33]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[34]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[35]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[36]  Lipo Wang,et al.  A Modified T-test Feature Selection Method and Its Application on the HapMap Genotype Data , 2008, Genom. Proteom. Bioinform..

[37]  Tom Michael Mitchell,et al.  Predicting Human Brain Activity Associated with the Meanings of Nouns , 2008, Science.

[38]  Huan Liu,et al.  Advancing Feature Selection Research − ASU Feature Selection Repository , 2010 .

[39]  C. Guestrin,et al.  Near-optimal sensor placements: maximizing information while minimizing communication cost , 2006, 2006 5th International Conference on Information Processing in Sensor Networks.

[40]  Sean M. Polyn,et al.  Beyond mind-reading: multi-voxel pattern analysis of fMRI data , 2006, Trends in Cognitive Sciences.

[41]  Alice J. O'Toole,et al.  Partially Distributed Representations of Objects and Faces in Ventral Temporal Cortex , 2005, Journal of Cognitive Neuroscience.

[42]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[43]  E. Mugnaini,et al.  Cell junctions and intramembrane particles of astrocytes and oligodendrocytes: A freeze-fracture study , 1982, Neuroscience.

[44]  Nikolaus Kriegeskorte,et al.  Comparison of multivariate classifiers and response normalizations for pattern-information fMRI , 2010, NeuroImage.

[45]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[46]  J. Haynes Brain Reading: Decoding Mental States From Brain Activity In Humans , 2011 .

[47]  H. Zou,et al.  Addendum: Regularization and variable selection via the elastic net , 2005 .

[48]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[49]  O. Mangasarian Minimum-support solutions of polyhedral concave programs * , 1999 .

[50]  Zenglin Xu,et al.  Discriminative Semi-Supervised Feature Selection Via Manifold Regularization , 2009, IEEE Transactions on Neural Networks.

[51]  Shiliang Zhang,et al.  Correlation-Based Feature Selection and Regression , 2010, PCM.

[52]  Jean-Jacques Fuchs,et al.  On the application of the global matched filter to DOA estimation with uniform circular arrays , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[53]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[54]  A. Ishai,et al.  Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex , 2001, Science.

[55]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[56]  W. Art Chaovalitwongse,et al.  Voxel Selection Framework in Multi-Voxel Pattern Analysis of fMRI Data for Prediction of Neural Response to Visual Stimuli , 2014, IEEE Transactions on Medical Imaging.

[57]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[58]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[59]  Huan Liu,et al.  Advancing feature selection research , 2010 .

[60]  Sharon L. Thompson-Schill,et al.  The advantage of brief fMRI acquisition runs for multi-voxel pattern detection across runs , 2012, NeuroImage.

[61]  Michel Verleysen,et al.  Advances in Feature Selection with Mutual Information , 2009, Similarity-Based Clustering.

[62]  László Lovász,et al.  Submodular functions and convexity , 1982, ISMP.

[63]  Tom M. Mitchell,et al.  Machine learning classifiers and fMRI: A tutorial overview , 2009, NeuroImage.

[64]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..