Classification Using Generalized Partial Least Squares

Advances in computational biology have made simultaneous monitoring of thousands of features possible. The high throughput technologies not only bring about a much richer information context in which to study various aspects of gene function, but they also present the challenge of analyzing data with a large number of covariates and few samples. As an integral part of machine learning, classification of samples into two or more categories is almost always of interest to scientists. We address the question of classification in this setting by extending partial least squares (PLS), a popular dimension reduction tool in chemometrics, in the context of generalized linear regression, based on a previous approach, iteratively reweighted partial least squares, that is, IRWPLS. We compare our results with two-stage PLS and with other classifiers. We show that by phrasing the problem in a generalized linear model setting and by applying Firth's procedure to avoid (quasi)separation, we often get lower classification error rates.

[1]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2]  W. Massy Principal Components Regression in Exploratory Statistical Research , 1965 .

[3]  S. James Press,et al.  International Encyclopedia of Statistics , 1978 .

[4]  A. Albert,et al.  On the existence of maximum likelihood estimates in logistic regression models , 1984 .

[5]  Thomas J. Santner,et al.  A note on A. Albert and J. A. Anderson's conditions for the existence of maximum likelihood estimates in logistic regression models , 1986 .

[6]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[7]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[8]  I. Helland ON THE STRUCTURE OF PARTIAL LEAST SQUARES REGRESSION , 1988 .

[9]  A. Höskuldsson PLS regression methods , 1988 .

[10]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[11]  David Firth,et al.  Bias reduction, the Jeffreys prior and GLIM , 1992 .

[12]  D. Firth Generalized Linear Models and Jeffreys Priors: An Iterative Weighted Least-Squares Approach , 1992 .

[13]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[14]  John A. Nelder,et al.  Generalized linear models. 2nd ed. , 1993 .

[15]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[16]  L. Fahrmeir,et al.  Multivariate statistical modelling based on generalized linear models , 1994 .

[17]  P. Garthwaite An Interpretation of Partial Least Squares , 1994 .

[18]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[19]  H. Schneeweiß,et al.  Factor Analysis and Principal Components , 1995 .

[20]  A note on ”() and ”’() , 1996 .

[21]  B. Marx Iteratively reweighted partial least squares estimation for generalized linear regression , 1996 .

[22]  Y. Chen,et al.  Ratio-based decisions and the quantitative analysis of cDNA microarray images. , 1997, Journal of biomedical optics.

[23]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[24]  C T Chen,et al.  A Probability‐based Multivariate Statistical Algorithm for Autofluorescence Spectroscopic Identification of Oral Carcinogenesis , 1999, Photochemistry and photobiology.

[25]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[26]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[27]  D. Botstein,et al.  A gene expression database for the molecular pharmacology of cancer , 2000, Nature Genetics.

[28]  Paul H. C. Eilers,et al.  Classification of microarray data with penalized logistic regression , 2001, SPIE BiOS.

[29]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[30]  M. Schemper,et al.  A solution to the problem of separation in logistic regression , 2002, Statistics in medicine.

[31]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[32]  Danh V. Nguyen,et al.  Multi-class cancer classification via partial least squares with gene expression profiles , 2002, Bioinform..

[33]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[34]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[36]  Eric R. Ziegel,et al.  Multivariate Statistical Modelling Based on Generalized Linear Models , 2002, Technometrics.

[37]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[38]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[39]  Gersende Fort,et al.  Classification Using Partial Least Squares with Penalized Logistic Regression , 2004 .

[40]  M. Forina,et al.  Multivariate calibration. , 2007, Journal of chromatography. A.