Investigating the Efficacy of Nonlinear Dimensionality Reduction Schemes in Classifying Gene and Protein Expression Studies

The recent explosion in procurement and availability of high-dimensional gene and protein expression profile data sets for cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. While some investigators are focused on identifying informative genes and proteins that play a role in specific diseases, other researchers have attempted instead to use patients based on their expression profiles to prognosticate disease status. A major limitation in the ability to accurately classify these high-dimensional data sets stems from the "curse of dimensionality," occurring in situations where the number of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly centered on the use of a dimensionality reduction (DR) scheme, principal component analysis (PCA), to obtain a low-dimensional projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on euclidean distances to estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. While some researchers have begun to explore nonlinear DR methods for computer vision problems such as face detection and recognition, to the best of our knowledge, few such attempts have been made for classification and visualization of high-dimensional biomedical data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene and protein expression studies. Toward this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, and Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, and Multidimensional Scaling) with the intent of determining a reduced subspace representation in which the individual object classes are more easily discriminable. Owing to the inherent nonlinear structure of gene and protein expression studies, our claim is that the nonlinear DR methods provide a more truthful low-dimensional representation of the data compared to the linear DR schemes. Evaluation of the DR schemes was done by 1) assessing the discriminability of two supervised classifiers (Support Vector Machine and C4.5 Decision Trees) in the different low- dimensional data embeddings and 2) five cluster validity measures to evaluate the size, distance, and tightness of object aggregates in the low-dimensional space. For each of the seven evaluation measures considered, statistically significant improvement in the quality of the embeddings across 10 cancer data sets via the use of three nonlinear DR schemes over three linear DR techniques was observed. Similar trends were observed when linear and nonlinear DR was applied to the high-dimensional data following feature pruning to isolate the most informative features. Qualitative evaluation of the low-dimensional data embedding obtained via the six DR methods further suggests that the nonlinear schemes are better able to identify potential novel classes (e.g., cancer subtypes) within the data.

[1]  S. Horvath,et al.  Gene Expression Profiling of Gliomas Strongly Predicts Survival , 2004, Cancer Research.

[2]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[3]  S. Chao,et al.  FEATURE DIMENSION REDUCTION FOR MICROARRAY DATA ANALYSIS USING LOCALLY LINEAR EMBEDDING , 2005 .

[4]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[5]  Csaba Legány,et al.  Cluster validity measurement techniques , 2006 .

[6]  Jianbo Shi,et al.  Graph Embedding to Improve Supervised Classification and Novel Class Detection: Application to Prostate Cancer , 2005, MICCAI.

[7]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[8]  Jens Nilsson,et al.  Approximate geodesic distances reveal biologically relevant structures in microarray data , 2004, Bioinform..

[9]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[10]  Dechang Chen,et al.  Gene Expression Data Classification With Kernel Principal Component Analysis , 2005, Journal of biomedicine & biotechnology.

[11]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[12]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[13]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[14]  Anant Madabhushi,et al.  AUTOMATED GRADING OF PROSTATE CANCER USING ARCHITECTURAL AND TEXTURAL IMAGE FEATURES , 2007, 2007 4th IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[15]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[16]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[18]  Kathleen R. Cho,et al.  Classifications of ovarian cancer tissues by proteomic patterns , 2006, Proteomics.

[19]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[20]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[21]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[22]  Jianbo Shi,et al.  Comparing Ensembles of Learners: Detecting Prostate Cancer from High Resolution MRI , 2006, CVAMIA.

[23]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[24]  Seuck Heun Song,et al.  Several biplot methods applied to gene expression data , 2008 .

[25]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[26]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Sanjoy Dasgupta,et al.  Adaptive Control Processes , 2010, Encyclopedia of Machine Learning and Data Mining.

[28]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[29]  Chris Vulpe,et al.  Discriminant analysis to evaluate clustering of gene expression data , 2002, FEBS letters.

[30]  G. Turashvili,et al.  Novel markers for differentiation of lobular and ductal invasive breast carcinomas by laser microdissection and microarray analysis , 2007, BMC Cancer.

[31]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[32]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[33]  B. Williams,et al.  Identification of genes differentially regulated by interferon alpha, beta, or gamma using oligonucleotide arrays. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Vojislav Kecman,et al.  Gene extraction for cancer diagnosis by support vector machines - An improvement , 2005, Artif. Intell. Medicine.

[35]  Woo Ick Yang,et al.  Molecular basis of the differences between normal and tumor tissues of gastric cancer. , 2007, Biochimica et biophysica acta.

[36]  Li Li,et al.  A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. , 2005, Genomics.

[37]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[38]  X. Zhang,et al.  Mining the structural knowledge of high-dimensional medical data using isomap , 2006, Medical and Biological Engineering and Computing.

[39]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[40]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[41]  Anant Madabhushi,et al.  A Hierarchical Unsupervised Spectral Clustering Scheme for Detection of Prostate Cancer from Magnetic Resonance Spectroscopy (MRS) , 2007, MICCAI.

[42]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[43]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[44]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[45]  Kevin Dawson,et al.  Sample phenotype clusters in high-density oligonucleotide microarray data sets are revealed using Isomap, a nonlinear algorithm , 2005, BMC Bioinformatics.

[46]  M. Tyers,et al.  Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. , 2002, Cancer research.

[47]  Caroline Truntzer,et al.  Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data , 2007, BMC Bioinformatics.

[48]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49]  Patrik Edén,et al.  Molecular signatures in childhood acute leukemia and their correlations to expression patterns in normal hematopoietic subpopulations. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[51]  Le Song,et al.  Gene selection via the BAHSIC family of algorithms , 2007, ISMB/ECCB.

[52]  David M. Rocke,et al.  Dimension Reduction for Classification with Gene Expression Microarray Data , 2006, Statistical applications in genetics and molecular biology.

[53]  Chao Shi,et al.  Feature dimension reduction for microarray data analysis using locally linear embedding , 2005, APBC.

[54]  Jarkko Venna,et al.  Local multidimensional scaling , 2006, Neural Networks.

[55]  B. Williams,et al.  Identification of genes differentially regulated by interferon α, β, or γ using oligonucleotide arrays , 1998 .

[56]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[57]  Yonghong Peng,et al.  A novel ensemble machine learning for robust microarray data classification , 2006, Comput. Biol. Medicine.

[58]  Graziano Pesole,et al.  Selection of relevant genes in cancer diagnosis based on their prediction accuracy , 2007, Artif. Intell. Medicine.

[59]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .