Functional embedding for the classification of gene expression profiles

MOTIVATION Low sample size n high-dimensional large p data with n<<p are commonly encountered in genomics and statistical genetics. Ill-conditioning of the variance-covariance matrix for such data renders the traditional multivariate data analytical approaches unattractive. On the other side, functional data analysis (FDA) approaches are designed for infinite-dimensional data and therefore may have potential for the analysis of large p data. We herein propose a functional embedding (FEM) technique, which exploits the interface between multivariate and functional data, aiming at borrowing strength across the sample through FDA techniques in order to resolve the difficulties caused by the high dimension p. RESULTS Using pairwise dissimilarities among predictor variables, one obtains a univariate configuration of these covariates. This is interpreted as variable ordination that defines the domain of a suitable function space, thus leading to the FEM of the high-dimensional data. The embedding may then be followed by functional logistic regression for the classification of high-dimensional multivariate data as an example for downstream analysis. The resulting functional classification is evaluated on several published gene expression array datasets and a mass spectrometric data, and is shown to compare favorably with various methods that have been employed previously for the classification of these high-dimensional gene expression profiles.

[1]  Pai-Ling Li,et al.  Correlation-Based Functional Clustering via Subspace Projection , 2008 .

[2]  H. Muller,et al.  Generalized functional linear models , 2005, math/0505638.

[3]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[4]  E. Aronson,et al.  Theory and method , 1985 .

[5]  Jeng-Min Chiou,et al.  Inferring gene expression dynamics via functional regression analysis , 2007, BMC Bioinformatics.

[6]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[7]  Adrian E. Raftery,et al.  Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data , 2005, Bioinform..

[8]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[9]  T. Auton Applied Functional Data Analysis: Methods and Case Studies , 2004 .

[10]  M. Kirkpatrick,et al.  A quantitative genetic model for growth, shape, reaction norms, and other infinite-dimensional characters , 1989, Journal of mathematical biology.

[11]  H. Müller,et al.  Local Polynomial Modeling and Its Applications , 1998 .

[12]  Robert E. Weiss,et al.  An Analysis of Paediatric Cd4 Counts for Acquired Immune Deficiency Syndrome Using Flexible Random Curves , 1996 .

[13]  B. Silverman,et al.  Functional Data Analysis , 1997 .

[14]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[15]  A. M. Aguilera,et al.  Principal component estimation of functional logistic regression: discussion of two different approaches , 2004 .

[16]  Colin O. Wu,et al.  Nonparametric Mixed Effects Models for Unequally Sampled Noisy Curves , 2001, Biometrics.

[17]  B. Silverman,et al.  Functional Data Analysis , 1997 .

[18]  Hans-Georg Ller,et al.  Functional Modelling and Classification of Longitudinal Data. , 2005 .

[19]  Joseph L. Zinnes,et al.  Theory and Methods of Scaling. , 1958 .

[20]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[21]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[22]  Jeng-Min Chiou,et al.  Functional clustering and identifying substructures of longitudinal data , 2007 .

[23]  H. H. Thodberg,et al.  Optimal minimal neural interpretation of spectra , 1992 .

[24]  R. Ash,et al.  Topics in stochastic processes , 1975 .

[25]  Hans-Georg Müller,et al.  Classification using functional data analysis for temporal gene expression data , 2006, Bioinform..

[26]  Dhammika Amaratunga,et al.  Exploration and Analysis of DNA Microarray and Protein Array Data , 2003, Wiley series in probability and statistics.

[27]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[29]  Gareth M. James Generalized linear models with functional predictors , 2002 .

[30]  Marrije R Buist,et al.  Gene expression in early stage cervical cancer. , 2008, Gynecologic oncology.

[31]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[32]  P. Groenen,et al.  Modern multidimensional scaling , 1996 .

[33]  H. Müller,et al.  Shrinkage Estimation for Functional Principal Component Scores with Application to the Population Kinetics of Plasma Folate , 2003, Biometrics.

[34]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..