Improving pattern classification of DNA microarray data by using PCA and logistic regression

DNA microarrays is a technology that can be used to diagnose cancer and other diseases. To automate the analysis of such data, pattern recognition and machine learning algorithms can be applied. However, the curse of dimensionality is unavoidable: very few samples to train, and many attributes in each sample. As the predictive accuracy of supervised classifiers decays with irrelevant and redundant features, the necessity of a dimensionality reduction process is essential. The main idea is to retain only the genes that are the most influential in the classification of the disease. In this paper, a new methodology based on Principal Component Analysis and Logistics Regression is proposed. Our method enables the selection of particular genes that are relevant for classification. Experiments were run using eight different classifiers on two benchmark datasets: Leukemia and Lymphoma. The results show that our method not only reduces the number of required attributes, but also increase the classification accuracy in more than 10% in all the cases we tested.

[1]  Abeer M. Mahmoud,et al.  ANALYSIS OF MACHINE LEARNING TECHNIQUES FOR GENE SELECTION AND CLASSIFICATION OF MICROARRAY DATA , 2013 .

[2]  Selvaraju Veeriah,et al.  Patient-derived xenografts of triple-negative breast cancer reproduce molecular features of patient tumors and respond to mTOR inhibition , 2013, Breast Cancer Research.

[3]  Thomas Wetter,et al.  Gene expression profiling of breast cancer survivability by pooled cDNA microarray analysis using logistic regression, artificial neural networks and decision trees , 2013, BMC Bioinformatics.

[4]  P. Saratchandran,et al.  Multicategory Classification Using An Extreme Learning Machine for Microarray Gene Expression Cancer Diagnosis , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Xue-wen Chen,et al.  Gene selection for cancer classification using bootstrapped genetic algorithms and support vector machines , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[6]  P. Sumathi,et al.  A NOVEL MICROARRAY GENE RANKING AND CLASSIFICATION USING EXTREME LEARNING MACHINE ALGORITHM , 2014 .

[7]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[8]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[9]  Jill P. Mesirov,et al.  Support Vector Machine Classification of Microarray Data , 2001 .

[10]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[11]  Sayan Mukherjee,et al.  Classifying Microarray Data Using Support Vector Machines , 2003 .

[12]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .

[13]  Mario Marchand,et al.  Feature Selection with Conjunctions of Decision Stumps and Learning from Microarray Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[15]  Xiaoxing Liu,et al.  An Entropy-based gene selection method for cancer classification using microarray data , 2005, BMC Bioinformatics.

[16]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[17]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[18]  Xibei Yang,et al.  Recognition of Multiple Imbalanced Cancer Types Based on DNA Microarray Data Using Ensemble Classifiers , 2013, BioMed research international.

[19]  Concha Bielza,et al.  Regularized logistic regression without a penalty term: An application to cancer classification with microarray data , 2011, Expert Syst. Appl..

[20]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[21]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[22]  Sung-Bae Cho,et al.  Towards Optimal Feature and Classifier for Gene Expression Classification of Cancer , 2002, AFSS.

[23]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[24]  Feng Chu,et al.  Applications of support vector machines to cancer classification with microarray data , 2005, Int. J. Neural Syst..

[25]  Li Shen,et al.  Reducing multiclass cancer classification to binary by output coding and SVM , 2006, Comput. Biol. Chem..

[26]  Yu-Min Chiang,et al.  The application of ant colony optimization for gene selection in microarray-based cancer classification , 2008, 2008 International Conference on Machine Learning and Cybernetics.

[27]  Sung-Bae Cho,et al.  Machine Learning in DNA Microarray Analysis for Cancer Classification , 2003, APBC.

[28]  J. Thomas,et al.  An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. , 2001, Genome research.

[29]  Jay L. Brewster,et al.  The microarray revolution: Perspectives from educators , 2004, Biochemistry and molecular biology education : a bimonthly publication of the International Union of Biochemistry and Molecular Biology.

[30]  Geoffrey A. Solano,et al.  Lung cancer classification using genetic algorithm to optimize prediction models , 2014, IISA 2014, The 5th International Conference on Information, Intelligence, Systems and Applications.

[31]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[32]  Tianzi Jiang,et al.  A combinational feature selection and ensemble neural network method for classification of gene expression data , 2004, BMC Bioinformatics.

[33]  Driss Aboutajdine,et al.  A two-stage gene selection scheme utilizing MRMR filter and GA wrapper , 2011, Knowledge and Information Systems.

[34]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[35]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[36]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[37]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[38]  Jin-Kao Hao,et al.  A Hybrid GA/SVM Approach for Gene Selection and Classification of Microarray Data , 2006, EvoWorkshops.

[39]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[40]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[41]  Xianggui Qu,et al.  Multivariate Data Analysis , 2007, Technometrics.

[42]  Dong-Ling Tong,et al.  Hybrid genetic algorithm-neural network: Feature extraction for unpreprocessed microarray data , 2011, Artif. Intell. Medicine.

[43]  Jack Y. Yang,et al.  A comparative study of different machine learning methods on microarray gene expression data , 2008, BMC Genomics.

[44]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[45]  Constantin F. Aliferis,et al.  Towards Principled Feature Selection: Relevancy, Filters and Wrappers , 2003 .

[46]  Jing Zhao,et al.  A Modified Ant Colony Optimization Algorithm for Tumor Marker Gene Selection , 2009, Genom. Proteom. Bioinform..

[47]  Juan Humberto Sossa Azuela,et al.  Pattern Analysis in DNA Microarray Data through PCA-Based Gene Selection , 2014, CIARP.

[48]  Jingjing Lu,et al.  Comparing naive Bayes, decision trees, and SVM with AUC and accuracy , 2003, Third IEEE International Conference on Data Mining.

[49]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[50]  J. G. Liao,et al.  Logistic regression for disease classification using microarray data: model selection in a large p and small n case , 2007, Bioinform..

[51]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[52]  Roland Linder,et al.  Microarray data classified by artificial neural networks. , 2007, Methods in molecular biology.

[53]  Beatriz A. Garro,et al.  Classification of DNA Microarrays Using Artificial Bee Colony (ABC) Algorithm , 2014, ICSI.

[54]  Juan Humberto Sossa Azuela,et al.  Efficient training for dendrite morphological neural networks , 2014, Neurocomputing.

[55]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[56]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[57]  StatnikovAlexander,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2005 .

[58]  Stephen T. C. Wong,et al.  Cancer classification and prediction using logistic regression with Bayesian gene selection , 2004, J. Biomed. Informatics.

[59]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[60]  Yonggwan Won,et al.  DNA Microarray Classification with Compact Single Hidden-Layer FeedForward Neural Networks , 2007, 2007 Frontiers in the Convergence of Bioscience and Information Technologies.

[61]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[62]  J. D. Vos,et al.  Comparison of gene expression profiling between malignant and normal plasma cells with oligonucleotide arrays , 2002, Oncogene.

[63]  Sophie Lambert-Lacroix,et al.  Effective dimension reduction methods for tumor classification using gene expression data , 2003, Bioinform..

[64]  D. Rimm,et al.  Classification of Breast Cancer Using Genetic Algorithms and Tissue Microarrays , 2006, Clinical Cancer Research.

[65]  Jian Huang,et al.  Regularized ROC method for disease classification and biomarker selection with microarray data , 2005, Bioinform..

[66]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.