Pattern Analysis in DNA Microarray Data through PCA-Based Gene Selection

DNA microarrays is a technology that can be used to diagnose cancer and other diseases. To automate the analysis of such data, pattern recognition and machine learning algorithms can be applied. However, the curse of dimensionality is unavoidable: very few samples to train, and many attributes in each sample. As the predictive accuracy of supervised classifiers decays with irrelevant and redundant features, the necessity of a dimensionality reduction process is essential. In this paper, we propose a new methodology that is based on the application of Principal Component Analysis and other statistical tools to gain insight in the identification of relevant genes. We run the approaches using two benchmark datasets: Leukemia and Lymphoma. The results show that it is possible to reduce considerably the number of genes while increasing the performance of well known classifiers.

[1]  Driss Aboutajdine,et al.  A two-stage gene selection scheme utilizing MRMR filter and GA wrapper , 2011, Knowledge and Information Systems.

[2]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[3]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[4]  Mario Marchand,et al.  Feature Selection with Conjunctions of Decision Stumps and Learning from Microarray Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[6]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[7]  René Natowicz,et al.  Optimization for feature selection in DNA microarrays , 2013 .

[8]  Jing Zhao,et al.  A Modified Ant Colony Optimization Algorithm for Tumor Marker Gene Selection , 2009, Genom. Proteom. Bioinform..

[9]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[10]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[11]  Sung-Bae Cho,et al.  Towards Optimal Feature and Classifier for Gene Expression Classification of Cancer , 2002, AFSS.

[12]  Michio Sugeno,et al.  Advances in Soft Computing — AFSS 2002 , 2002, Lecture Notes in Computer Science.

[13]  J. Hair Multivariate data analysis , 1972 .

[14]  Christopher Bishop,et al.  Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics , 2003 .

[15]  Sang-Hong Lee,et al.  A composite gene selection for DNA microarray data analysis , 2013, Multimedia Tools and Applications.

[16]  Feng Chu,et al.  Applications of support vector machines to cancer classification with microarray data , 2005, Int. J. Neural Syst..

[17]  Juan Humberto Sossa Azuela,et al.  Efficient training for dendrite morphological neural networks , 2014, Neurocomputing.

[18]  Dong-Ling Tong,et al.  Hybrid genetic algorithm-neural network: Feature extraction for unpreprocessed microarray data , 2011, Artif. Intell. Medicine.

[19]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[20]  Constantin F. Aliferis,et al.  Towards Principled Feature Selection: Relevancy, Filters and Wrappers , 2003 .

[21]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[22]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[23]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[24]  Patrick Siarry Heuristics: Theory and Applications , 2013 .

[25]  Qinyu. Zhu Extreme Learning Machine , 2013 .

[26]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[27]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[28]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[29]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[30]  Rolph E. Anderson,et al.  Multivariate Data Analysis (7th ed. , 2009 .