Probabilistic Feature Selection and Classification Vector Machine

Sparse Bayesian learning is a state-of-the-art supervised learning algorithm that can choose a subset of relevant samples from the input data and make reliable probabilistic predictions. However, in the presence of high-dimensional data with irrelevant features, traditional sparse Bayesian classifiers suffer from performance degradation and low efficiency due to the incapability of eliminating irrelevant features. To tackle this problem, we propose a novel sparse Bayesian embedded feature selection algorithm that adopts truncated Gaussian distributions as both sample and feature priors. The proposed algorithm, called probabilistic feature selection and classification vector machine (PFCVMLP) is able to simultaneously select relevant features and samples for classification tasks. In order to derive the analytical solutions, Laplace approximation is applied to compute approximate posteriors and marginal likelihoods. Finally, parameters and hyperparameters are optimized by the type-II maximum likelihood method. Experiments on three datasets validate the performance of PFCVMLP along two dimensions: classification performance and effectiveness for feature selection. Finally, we analyze the generalization performance and derive a generalization error bound for PFCVMLP. By tightening the bound, the importance of feature selection is demonstrated.

[1]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[2]  Huanhuan Chen,et al.  Scalable Graph-Based Semi-Supervised Learning through Sparse Bayesian Model , 2017, IEEE Transactions on Knowledge and Data Engineering.

[3]  Michael R. Lyu,et al.  Efficient online learning for multitask feature selection , 2013, TKDD.

[4]  Gavin C. Cawley,et al.  Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation , 2006, NIPS.

[5]  Hao Wang,et al.  Online Streaming Feature Selection , 2010, ICML.

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  Xingming Sun,et al.  Structural Minimax Probability Machine , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[8]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[9]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[10]  Jing Wang,et al.  Online Feature Selection with Group Structure Analysis , 2015, IEEE Transactions on Knowledge and Data Engineering.

[11]  Xindong Wu,et al.  Towards Scalable and Accurate Online Feature Selection for Big Data , 2014, 2014 IEEE International Conference on Data Mining.

[12]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[13]  Darren J. Wilkinson,et al.  Bayesian methods in bioinformatics and computational systems biology , 2006, Briefings Bioinform..

[14]  Hao Wang,et al.  Classification with Streaming Features: An Emerging-Pattern Mining Approach , 2015, TKDD.

[15]  Huanhuan Chen,et al.  Efficient Probabilistic Classification Vector Machine With Incremental Basis Function Selection , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[17]  Christopher M. Bishop,et al.  Variational Relevance Vector Machines , 2000, UAI.

[18]  Maarten de Rijke,et al.  Incremental Sparse Bayesian Ordinal Regression , 2018, Neural Networks.

[19]  Michael E. Tipping,et al.  Analysis of Sparse Bayesian Learning , 2001, NIPS.

[20]  Feiping Nie,et al.  Trace Ratio Criterion for Feature Selection , 2008, AAAI.

[21]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[22]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[23]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[24]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[25]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Yalda Mohsenzadeh,et al.  Incremental relevance sample-feature machine: A fast marginal likelihood maximization approach for joint feature selection and classification , 2016, Pattern Recognit..

[27]  Huanhuan Chen,et al.  Sparse Bayesian approach for feature selection , 2014, 2014 IEEE Symposium on Computational Intelligence in Big Data (CIBD).

[28]  Hao Wang,et al.  Online Feature Selection with Streaming Features , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Eric R. Ziegel,et al.  Engineering Statistics , 2004, Technometrics.

[30]  Rich Caruana,et al.  Data mining in metric space: an empirical analysis of supervised learning performance criteria , 2004, ROCAI.

[31]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[32]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[33]  Fernando De la Torre,et al.  Optimal feature selection for support vector machines , 2010, Pattern Recognit..

[34]  David G. Stork,et al.  Pattern Classification , 1973 .

[35]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[36]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[37]  Yalda Mohsenzadeh,et al.  The Relevance Sample-Feature Machine: A Sparse Bayesian Learning Approach to Joint Feature-Sample Selection , 2013, IEEE Transactions on Cybernetics.

[38]  Theodoros Damoulas,et al.  Multiclass Relevance Vector Machines: Sparsity and Accuracy , 2010, IEEE Transactions on Neural Networks.

[39]  Huanhuan Chen,et al.  Probabilistic Classification Vector Machines , 2009, IEEE Transactions on Neural Networks.

[40]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[41]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[42]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[43]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[44]  Ran El-Yaniv,et al.  Transductive Rademacher Complexity and Its Applications , 2007, COLT.

[45]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[46]  Huanhuan Chen,et al.  Robust twin boosting for feature selection from high-dimensional omics data with label noise , 2015, Inf. Sci..

[47]  Bao-Liang Lu,et al.  Differential entropy feature for EEG-based emotion classification , 2013, 2013 6th International IEEE/EMBS Conference on Neural Engineering (NER).

[48]  Yi Li,et al.  Bayesian automatic relevance determination algorithms for classifying gene expression data. , 2002, Bioinformatics.

[49]  Ivor W. Tsang,et al.  Incremental Subgraph Feature Selection for Graph Classification , 2017, IEEE Transactions on Knowledge and Data Engineering.

[50]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[51]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[52]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[53]  Thorsten Joachims,et al.  Coactive Learning , 2015, J. Artif. Intell. Res..

[54]  Nenghai Yu,et al.  Large-Scale Online Feature Selection for Ultra-High Dimensional Sparse Data , 2014, ACM Trans. Knowl. Discov. Data.

[55]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Jacek M. Zurada,et al.  A Class of Single-Class Minimax Probability Machines for Novelty Detection , 2007, IEEE Transactions on Neural Networks.

[57]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Gavin Brown,et al.  Measuring the Stability of Feature Selection , 2016, ECML/PKDD.

[59]  Michael I. Jordan,et al.  A Robust Minimax Approach to Classification , 2003, J. Mach. Learn. Res..

[60]  Ron Meir,et al.  Generalization Error Bounds for Bayesian Mixture Algorithms , 2003, J. Mach. Learn. Res..

[61]  Bao-Liang Lu,et al.  Identifying Stable Patterns over Time for Emotion Recognition from EEG , 2016, IEEE Transactions on Affective Computing.

[62]  Lawrence Carin,et al.  A Bayesian approach to joint feature selection and classifier design , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Lawrence Carin,et al.  Joint classifier and feature optimization for cancer diagnosis using gene expression data , 2003, RECOMB '03.

[64]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[65]  Bao-Liang Lu,et al.  Investigating Critical Frequency Bands and Channels for EEG-Based Emotion Recognition with Deep Neural Networks , 2015, IEEE Transactions on Autonomous Mental Development.

[66]  Lai-Wan Chan,et al.  The Minimum Error Minimax Probability Machine , 2004, J. Mach. Learn. Res..