High-Dimensional Limited-Sample Biomedical Data Classification Using Variational Autoencoder

Biomedical prediction is vital to the modern scientific view of life, but it is a challenging task due to high-dimensionality, limited-sample size (also known as HDLSS problem), non-linearity, and data types tend are complex. A large number of dimensionality reduction techniques developed, but, unfortunately, not efficient with small-sample (observation) size dataset. To overcome the pitfalls of the sample-size and dimensionality this study employed variational autoencoder (VAE), which is a powerful framework for unsupervised learning in recent years. The aim of this study is to investigate a reliable biomedical diagnosis method for HDLSS dataset with minimal error. Hence, to evaluate the strength of the proposed model six genomic microarray datasets from Kent Ridge Repository were applied. In the experiment, several choices of dimensions were selected for data preprocessing. Moreover, to find a stable and suitable classifier, different popular classifiers were applied. The experimental results found that the VAE can provide superior performance compared to the traditional methods such as PCA, fastICA, FA, NMF, and LDA.

[1]  N. Toschi,et al.  The “Peeking” Effect in Supervised Feature Selection on Diffusion Tensor Imaging Data , 2013, American Journal of Neuroradiology.

[2]  Casey S. Greene,et al.  Unsupervised Feature Construction and Knowledge Extraction from Genome-Wide Assays of Breast Cancer with Denoising Autoencoders , 2014, Pacific Symposium on Biocomputing.

[3]  Md Zahidul Islam,et al.  EXPLORE: A Novel Decision Tree Classification Algorithm , 2010, BNCOD.

[4]  Francisco Tirado,et al.  bioNMF: a versatile tool for non-negative matrix factorization in biology , 2006, BMC Bioinformatics.

[5]  Md Zahidul Islam,et al.  AWST: A Novel Attribute Weight Selection Technique for Data Clustering , 2015, AusDM.

[6]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[7]  Chih-Jen Lin,et al.  Dual coordinate descent methods for logistic regression and maximum entropy models , 2011, Machine Learning.

[8]  Yuan Gao,et al.  Improving molecular cancer class discovery through sparse non-negative matrix factorization , 2005 .

[9]  Andrzej Cichocki,et al.  Fast Local Algorithms for Large Scale Nonnegative Matrix and Tensor Factorizations , 2009, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[10]  Hsin-Min Lu,et al.  Modeling healthcare data using multiple-channel latent Dirichlet allocation , 2016, J. Biomed. Informatics.

[11]  Md Zahidul Islam,et al.  Optimizing the number of trees in a decision forest to discover a subforest with high ensemble accuracy using a genetic algorithm , 2016, Knowl. Based Syst..

[12]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[13]  Md Zahidul Islam,et al.  Forest PA: Constructing a decision forest by penalizing attributes used in previous trees , 2017, Expert Syst. Appl..

[14]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[15]  Kehong Yuan,et al.  Reducing microarray data via nonnegative matrix factorization for visualization and clustering analysis , 2008, J. Biomed. Informatics.

[16]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[17]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[18]  Md Zahidul Islam,et al.  Novel algorithms for cost-sensitive classification and knowledge discovery in class imbalanced datasets with an application to NASA software defects , 2018, Inf. Sci..

[19]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[20]  Reza Ghaeini,et al.  A Deep Learning Approach for Cancer Detection and Relevant Gene Identification , 2017, PSB.

[21]  David M. Rocke,et al.  Dimension Reduction for Classification with Gene Expression Microarray Data , 2006, Statistical applications in genetics and molecular biology.

[22]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[23]  Aman Gupta,et al.  Learning structure in gene expression data using deep architectures, with an application to gene clustering , 2015 .

[24]  I. Jolliffe Principal Component Analysis , 2002 .

[25]  Weizhong Zhao,et al.  Topic modeling for cluster analysis of large biological and medical datasets , 2014, BMC Bioinformatics.

[26]  E. Gehan,et al.  The properties of high-dimensional data spaces: implications for exploring gene and protein expression data , 2008, Nature Reviews Cancer.

[27]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[28]  Rajashree Dash,et al.  Feature selection in gene expression data using principal component analysis and rough set theory. , 2011, Advances in experimental medicine and biology.

[29]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[30]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[31]  Dmitrij Frishman,et al.  Pitfalls of supervised feature selection , 2009, Bioinform..

[32]  Md Zahidul Islam,et al.  Knowledge Discovery through SysFor - a Systematically Developed Forest of Multiple Decision Trees , 2011, AusDM.

[33]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[34]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[35]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  Trevor Hastie,et al.  Multi-class AdaBoost ∗ , 2009 .

[38]  Michael W. Berry,et al.  Algorithms and applications for approximate nonnegative matrix factorization , 2007, Comput. Stat. Data Anal..

[39]  Mitchell H. Tsai,et al.  The Curse of Dimensionality. , 2018, Anesthesiology.

[40]  Amit P. Sheth,et al.  A Novel Approach for Classifying Gene Expression Data using Topic Modeling , 2017, BCB.