Kernelized partial least squares for feature reduction and classification of gene microarray data

BackgroundThe primary objectives of this paper are: 1.) to apply Statistical Learning Theory (SLT), specifically Partial Least Squares (PLS) and Kernelized PLS (K-PLS), to the universal "feature-rich/case-poor" (also known as "large p small n", or "high-dimension, low-sample size") microarray problem by eliminating those features (or probes) that do not contribute to the "best" chromosome bio-markers for lung cancer, and 2.) quantitatively measure and verify (by an independent means) the efficacy of this PLS process. A secondary objective is to integrate these significant improvements in diagnostic and prognostic biomedical applications into the clinical research arena. That is, to devise a framework for converting SLT results into direct, useful clinical information for patient care or pharmaceutical research. We, therefore, propose and preliminarily evaluate, a process whereby PLS, K-PLS, and Support Vector Machines (SVM) may be integrated with the accepted and well understood traditional biostatistical "gold standard", Cox Proportional Hazard model and Kaplan-Meier survival analysis methods. Specifically, this new combination will be illustrated with both PLS and Kaplan-Meier followed by PLS and Cox Hazard Ratios (CHR) and can be easily extended for both the K-PLS and SVM paradigms. Finally, these previously described processes are contained in the Fine Feature Selection (FFS) component of our overall feature reduction/evaluation process, which consists of the following components: 1.) coarse feature reduction, 2.) fine feature selection and 3.) classification (as described in this paper) and prediction.ResultsOur results for PLS and K-PLS showed that these techniques, as part of our overall feature reduction process, performed well on noisy microarray data. The best performance was a good 0.794 Area Under a Receiver Operating Characteristic (ROC) Curve (AUC) for classification of recurrence prior to or after 36 months and a strong 0.869 AUC for classification of recurrence prior to or after 60 months. Kaplan-Meier curves for the classification groups were clearly separated, with p-values below 4.5e-12 for both 36 and 60 months. CHRs were also good, with ratios of 2.846341 (36 months) and 3.996732 (60 months).ConclusionsSLT techniques such as PLS and K-PLS can effectively address difficult problems with analyzing biomedical data such as microarrays. The combinations with established biostatistical techniques demonstrated in this paper allow these methods to move from academic research and into clinical practice.

[1]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[4]  Yi Zhang,et al.  Gene expression signatures for predicting prognosis of squamous cell and adenocarcinomas of the lung. , 2006, Cancer research.

[5]  Melissa Bondy,et al.  Residual risk of breast cancer recurrence 5 years after adjuvant therapy. , 2008, Journal of the National Cancer Institute.

[6]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[7]  David B. Fogel,et al.  Evolutionary computation - toward a new philosophy of machine intelligence (3. ed.) , 1995 .

[8]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[9]  Le Song,et al.  Gene selection via the BAHSIC family of algorithms , 2007, ISMB/ECCB.

[10]  J. David Schaffer,et al.  Evolutionary computation with noise perturbation and cluster analysis to discover biomarker sets , 2011, Complex Adaptive Systems.

[11]  E. Kaplan,et al.  Nonparametric Estimation from Incomplete Observations , 1958 .

[12]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[13]  Igor Jurisica,et al.  Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study , 2008, Nature Medicine.

[14]  John J. Heine,et al.  Comparison of Logistics Regression (LR) and Evolutionary Programming (EP) Derived Support Vector Machines (SVM) and Chi Squared Derived Results for Breast Cancer Diagnosis , 2006 .

[15]  David B. Fogel,et al.  Evolutionary Computation: Towards a New Philosophy of Machine Intelligence , 1995 .

[16]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  D.,et al.  Regression Models and Life-Tables , 2022 .

[18]  Alda Mizaku,et al.  Biomolecular Feature Selection of Colorectal Cancer Microarray Data Using GA-SVM Hybrid , 2009 .

[19]  Yong Qian,et al.  Hybrid Models Identified a 12-Gene Signature for Lung Cancer Prognosis and Chemoresponse Prediction , 2010, PloS one.

[20]  Olivier Chapelle,et al.  Model Selection for Support Vector Machines , 1999, NIPS.

[21]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[22]  Yiming Yang,et al.  Analysis of recursive gene selection approaches from microarray data , 2005, Bioinform..

[23]  C. Daub,et al.  BMC Systems Biology , 2007 .

[24]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[25]  Kristin P. Bennett,et al.  An Optimization Perspective on Kernel Partial Least Squares Regression , 2003 .

[26]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[27]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[28]  Alda Mizaku,et al.  Performance evaluation of evolutionary computational and conventionally trained support vector machines , 2007, SPIE Defense + Commercial Sensing.

[29]  John J. Heine,et al.  Evaluation of two key machine intelligence technologies , 2007, SPIE Defense + Commercial Sensing.

[30]  Thomas Bäck,et al.  Evolutionary computation: Toward a new philosophy of machine intelligence , 1997, Complex..

[31]  Zhifu Sun,et al.  A Gene Expression Signature Predicts Survival of Patients with Stage I Non-Small Cell Lung Cancer , 2006, PLoS medicine.

[32]  Kenneth H Buetow,et al.  Interlaboratory comparability study of cancer gene expression analysis using oligonucleotide microarrays. , 2005, Clinical cancer research : an official journal of the American Association for Cancer Research.

[33]  Conrad Sanderson,et al.  An Efficient Alternative to SVM Based Recursive Feature Elimination with Applications in Natural Language Processing and Bioinformatics , 2006, Australian Conference on Artificial Intelligence.

[34]  Richard Simon,et al.  Gene expression-based prognostic signatures in lung cancer: ready for clinical use? , 2010, Journal of the National Cancer Institute.

[35]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[36]  David B. Fogel,et al.  Evolutionary Computation: Toward a New Philosophy of Machine Intelligence (IEEE Press Series on Computational Intelligence) , 2006 .

[37]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.