A Novel Hybrid Dimension Reduction Technique for Undersized High Dimensional Gene Expression Data Sets Using Information Complexity Criterion for Cancer Classification

Gene expression data typically are large, complex, and highly noisy. Their dimension is high with several thousand genes (i.e., features) but with only a limited number of observations (i.e., samples). Although the classical principal component analysis (PCA) method is widely used as a first standard step in dimension reduction and in supervised and unsupervised classification, it suffers from several shortcomings in the case of data sets involving undersized samples, since the sample covariance matrix degenerates and becomes singular. In this paper we address these limitations within the context of probabilistic PCA (PPCA) by introducing and developing a new and novel approach using maximum entropy covariance matrix and its hybridized smoothed covariance estimators. To reduce the dimensionality of the data and to choose the number of probabilistic PCs (PPCs) to be retained, we further introduce and develop celebrated Akaike's information criterion (AIC), consistent Akaike's information criterion (CAIC), and the information theoretic measure of complexity (ICOMP) criterion of Bozdogan. Six publicly available undersized benchmark data sets were analyzed to show the utility, flexibility, and versatility of our approach with hybridized smoothed covariance matrix estimators, which do not degenerate to perform the PPCA to reduce the dimension and to carry out supervised classification of cancer groups in high dimensions.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  van M.H. Emden,et al.  An analysis of complexity , 1971 .

[3]  Hamparsun Bozdogan,et al.  A new class of information complexity (ICOMP) criteria with an application to customer profiling and segmentation , 2009 .

[4]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[5]  Ivet Bahar,et al.  Principal component analysis of native ensembles of biomolecular structures (PCA_NEST): insights into functional dynamics , 2009, Bioinform..

[6]  Denzil G. Fiebig,et al.  Exploiting continuity : maximum entropy estimation of continuous distributions , 1986 .

[7]  Joshua M. Stuart,et al.  MICROARRAY EXPERIMENTS : APPLICATION TO SPORULATION TIME SERIES , 1999 .

[8]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[9]  Isobel Claire Gormley,et al.  Probabilistic principal component analysis for metabolomic data , 2010, BMC Bioinformatics.

[10]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[11]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[12]  Yoshihiko Konno Estimation of a Normal Covariance Matrix with Incomplete Data under Stein's Loss , 1995 .

[13]  Hamparsum Bozdogan,et al.  Data Adaptive Simultaneous Parameter and Kernel Selection in Kernel Discriminant Analysis Using Information Complexity , 2009 .

[14]  D. Haughton,et al.  Informational complexity criteria for regression models , 1998 .

[15]  Hamparsum Bozdogan,et al.  Kernel PCA for feature extraction with information complexity , 2003 .

[16]  H. Bozdogan On the information-based measure of covariance complexity and its application to the evaluation of multivariate linear models , 1990 .

[17]  Michael R. Kosorok,et al.  Identification of differential gene pathways with principal component analysis , 2009, Bioinform..

[18]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[19]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[20]  De-Shuang Huang,et al.  Independent component analysis-based penalized discriminant method for tumor classification using gene expression data , 2006, Bioinform..

[21]  Mathematisch-Naturwissenschaftlichen Fakultat,et al.  Approaches to analyse and interpret biological profile data , 2006 .

[22]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[23]  H. Bozdogan Model selection and Akaike's Information Criterion (AIC): The general theory and its analytical extensions , 1987 .

[24]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[25]  Carlos E. Thomaz,et al.  Maximum entropy covariance estimate for statistical pattern recognition , 2004 .

[26]  Olivier Ledoit,et al.  Improved estimation of the covariance matrix of stock returns with an application to portfolio selection , 2003 .

[27]  H. Bozdogan,et al.  Akaike's Information Criterion and Recent Developments in Information Complexity. , 2000, Journal of mathematical psychology.

[28]  L. R. Haff Empirical Bayes Estimation of the Multivariate Normal Covariance Matrix , 1980 .

[29]  Xi Chen,et al.  Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes , 2008, Bioinform..

[30]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Hamparsum Bozdogan,et al.  Intelligent Statistical Data Mining with Information Complexity and Genetic Algorithms , 2004 .

[32]  Olivier Ledoit,et al.  A well-conditioned estimator for large-dimensional covariance matrices , 2004 .

[33]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[34]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[35]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[36]  Denzil G. Fiebig,et al.  Exploiting Continuity: Maximum Entropy Estimation of Continuous Distributions , 1989 .

[37]  Matthias Scholz,et al.  Approaches toanalyse and interpret biological profile data , 2006 .

[38]  Hamparsum Bozdogan,et al.  Misspecified Multivariate Regression Models Using the Genetic Algorithm and Information Complexity as the Fitness Function , 2012 .

[39]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[40]  H. Bozdogan Choosing the Number of Component Clusters in the Mixture-Model Using a New Informational Complexity Criterion of the Inverse-Fisher Information Matrix , 1993 .

[41]  Denzil G. Fiebig On the maximum-entropy approach to undersized samples , 1984 .