Fast and Scalable Feature Selection for Gene Expression Data Using Hilbert-Schmidt Independence Criterion

Goal: In computational biology, selecting a small subset of informative genes from microarray data continues to be a challenge due to the presence of thousands of genes. This paper aims at quantifying the dependence between gene expression data and the response variables and to identifying a subset of the most informative genes using a fast and scalable multivariate algorithm. Methods: A novel algorithm for feature selection from gene expression data was developed. The algorithm was based on the Hilbert-Schmidt independence criterion (HSIC), and was partly motivated by singular value decomposition (SVD). Results: The algorithm is computationally fast and scalable to large datasets. Moreover, it can be applied to problems with any type of response variables including, biclass, multiclass, and continuous response variables. The performance of the proposed algorithm in terms of accuracy, stability of the selected genes, speed, and scalability was evaluated using both synthetic and real-world datasets. The simulation results demonstrated that the proposed algorithm effectively and efficiently extracted stable genes with high predictive capability, in particular for datasets with multiclass response variables. Conclusion/Significance: The proposed method does not require the whole microarray dataset to be stored in memory, and thus can easily be scaled to large datasets. This capability is an important attribute in big data analytics, where data can be large and massively distributed.

[1]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[2]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, SIAM Rev..

[3]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[4]  Xin Yao,et al.  Gene selection algorithms for microarray data based on least squares support vector machine , 2006, BMC Bioinformatics.

[5]  References , 1971 .

[6]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Doulaye Dembélé,et al.  Fold change rank ordering statistics: a new method for detecting differentially expressed genes , 2014, BMC Bioinformatics.

[8]  P. Maher,et al.  Handbook of Matrices , 1999, The Mathematical Gazette.

[9]  Mohamed S. Kamel,et al.  Greedy column subset selection for large-scale data sets , 2014, Knowledge and Information Systems.

[10]  David R. Bickel,et al.  Validation of differential gene expression algorithms: Application comparing fold-change estimation to hypothesis testing , 2010, BMC Bioinformatics.

[11]  Arthur Gretton,et al.  Learning Taxonomies by Dependence Maximization , 2008, NIPS.

[12]  Le Song,et al.  Kernelized Sorting , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[15]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[16]  Jian Huang,et al.  Regularized ROC method for disease classification and biomarker selection with microarray data , 2005, Bioinform..

[17]  Fakhri Karray,et al.  Multiview Supervised Dictionary Learning in Speech Emotion Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[19]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[20]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[21]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[22]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[23]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[24]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[25]  G. W. STEWARTt ON THE EARLY HISTORY OF THE SINGULAR VALUE DECOMPOSITION * , 2022 .

[26]  Qiang Cheng,et al.  The Fisher-Markov Selector: Fast Selecting Maximally Separable Feature Subset for Multiclass Classification with Applications to High-Dimensional Data , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Arthur Gretton,et al.  Taxonomy Inference Using Kernel Dependence Measures , 2008 .

[28]  Le Song,et al.  Gene selection via the BAHSIC family of algorithms , 2007, ISMB/ECCB.

[29]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[30]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[31]  Le Song,et al.  Supervised feature selection via dependence estimation , 2007, ICML '07.

[32]  Cesare Furlanello,et al.  Effect of Size and Heterogeneity of Samples on Biomarker Discovery: Synthetic and Real Data Assessment , 2012, PloS one.

[33]  Xin Zhou,et al.  LS Bound based gene selection for DNA microarray data , 2005, Bioinform..

[34]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[35]  Mohamed S. Kamel,et al.  Kernelized Supervised Dictionary Learning , 2012, IEEE Transactions on Signal Processing.

[36]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[37]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[38]  E. Kaplan,et al.  Nonparametric Estimation from Incomplete Observations , 1958 .

[39]  Ludwig Lausser,et al.  Measuring and visualizing the stability of biomarker selection techniques , 2011, Computational Statistics.

[40]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[41]  Guillermo Sapiro,et al.  See all by looking at a few: Sparse modeling for finding representative objects , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[43]  Bernhard Schölkopf,et al.  Kernel Methods for Measuring Independence , 2005, J. Mach. Learn. Res..

[44]  Pradipta Maji,et al.  $f$-Information Measures for Efficient Selection of Discriminative Genes From Microarray Data , 2009, IEEE Transactions on Biomedical Engineering.

[45]  H. Luetkepohl The Handbook of Matrices , 1996 .

[46]  Hao Shen,et al.  Fast Kernel-Based Independent Component Analysis , 2009, IEEE Transactions on Signal Processing.

[47]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[48]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[49]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[50]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .