Applying Data Mining Techniques for Cancer Classification from Gene Expression Data

Recent studies on molecular level classification of tissues have produced remarkable results, and indicated that gene expression assays could significantly aid in the development of efficient cancer diagnosis and classification platforms. However, cancer classification based on the DNA array data is still a difficult problem. The main challenge is the overwhelming number of genes relative to the number of training samples. It makes accurate classification of data more difficult. This paper applies genetic algorithms (GA) with an initial solution provided by t- statistics (t-GA) for selecting a group of relevant genes from cancer microarray data. The decision tree based cancer classifier is then built on top of these selected genes. The performance of this approach is evaluated by comparing with other gene selection methods using the publicly available gene expression datasets. Experimental results indicate that t-GA has the highest accurate rate among different methods. The Z-score figure also shows that the gene selection operation provided by t-GA is reproducible.

[1]  R. Shiller,et al.  Testing the Random Walk Hypothesis: Power Versus Frequency of Observation , 1985 .

[2]  Ahmad Zubaidi Baharumshah,et al.  Mean-reverting behavior of current account in Asian countries , 2005 .

[3]  Ronald MacDonald,et al.  Testing for the long run relationship between nominal interest rates and inflation using cointegration techniques , 1989 .

[4]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Yangru Wu Are Real Exchange Rates Nonstationary? Evidence from a Panel-Data Test , 1996 .

[6]  Stephen G. Cecchetti,et al.  Inflation and Uncertainty at Short and Long Horizons , 1990 .

[7]  Mark P. Taylor,et al.  The behavior of real exchange rates during the post-Bretton Woods period , 1998 .

[8]  Robert B. Barsky,et al.  The Fisher Hypothesis and the Forecastability and Persistence of Inflation , 1986 .

[9]  David H. Papell Searching for stationarity: Purchasing power parity under the current float , 1997 .

[10]  David E. Rapach,et al.  Are real interest rates really nonstationary? New evidence from tests with good size and power , 2004 .

[11]  Jill P. Mesirov,et al.  Class prediction and discovery using gene expression data , 2000, RECOMB '00.

[12]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[13]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[14]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[15]  Robert McNown,et al.  Misleading Inferences from Panel Unit‐Root Tests with an Illustration from Purchasing Power Parity , 2001 .

[16]  Ronald MacDonald,et al.  Panel unit root tests and real exchange rates , 1996 .

[17]  G. Caporale,et al.  Common stochastic trends and inflation convergence in the EMS , 1993 .

[18]  A. Brazma,et al.  Gene expression data analysis , 2000, FEBS letters.

[19]  P. Phillips Testing for a Unit Root in Time Series Regression , 1988 .

[20]  Jyh-Lin Wu,et al.  Is Purchasing Power Parity Overvalued , 2001 .

[21]  S. Ramlee,et al.  Is the Fisher Effect for Real?: Testing the Robustness of the Long Run Fisher Effect in the G7 Countries , 2022 .

[22]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[23]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[24]  Juan Liu,et al.  Selecting informative genes with parallel genetic algorithms in tissue classification. , 2001, Genome informatics. International Conference on Genome Informatics.

[25]  D. Nachane,et al.  Wages and prices in Europe: A test of the German leadership thesis , 1990 .

[26]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[27]  P. Perron,et al.  Lag Length Selection and the Construction of Unit Root Tests with Good Size and Power , 2001 .

[28]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[29]  Apostolos Serletis,et al.  On the Fisher Effect , 1999 .

[30]  P. Perron,et al.  The Great Crash, The Oil Price Shock And The Unit Root Hypothesis , 1989 .

[31]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[32]  Keun-Yeob Oh,et al.  Purchasing power parity and unit root tests using panel data , 1996 .

[33]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[34]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[35]  David H. Papell,et al.  The Choice of Numeraire Currency in Panel Tests of Purchasing Power Parity , 2001 .

[36]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..