Data mining and genetic algorithm based gene/SNP selection

OBJECTIVE Genomic studies provide large volumes of data with the number of single nucleotide polymorphisms (SNPs) ranging into thousands. The analysis of SNPs permits determining relationships between genotypic and phenotypic information as well as the identification of SNPs related to a disease. The growing wealth of information and advances in biology call for the development of approaches for discovery of new knowledge. One such area is the identification of gene/SNP patterns impacting cure/drug development for various diseases. METHODS A new approach for predicting drug effectiveness is presented. The approach is based on data mining and genetic algorithms. A global search mechanism, weighted decision tree, decision-tree-based wrapper, a correlation-based heuristic, and the identification of intersecting feature sets are employed for selecting significant genes. RESULTS The feature selection approach has resulted in 85% reduction of number of features. The relative increase in cross-validation accuracy and specificity for the significant gene/SNP set was 10% and 3.2%, respectively. CONCLUSION The feature selection approach was successfully applied to data sets for drug and placebo subjects. The number of features has been significantly reduced while the quality of knowledge was enhanced. The feature set intersection approach provided the most significant genes/SNPs. The results reported in the paper discuss associations among SNPs resulting in patient-specific treatment protocols.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Lawrence. Davis,et al.  Handbook Of Genetic Algorithms , 1990 .

[3]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[5]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1992, Artificial Intelligence.

[6]  Julia V. Ponomarenko,et al.  Mining DNA sequences to predict sites which mutations cause genetic diseases , 2002, Knowl. Based Syst..

[7]  Patrik D'haeseleer,et al.  Genetic network inference: from co-expression clustering to reverse engineering , 2000, Bioinform..

[8]  Andrew Kusiak,et al.  Autonomous decision-making: a data mining approach , 2000, IEEE Transactions on Information Technology in Biomedicine.

[9]  A. Rafalski,et al.  High-throughput identification, database storage and analysis of SNPs in EST sequences. , 2001, Genome informatics. International Conference on Genome Informatics.

[10]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[11]  R. Shamir,et al.  An algorithm for clustering cDNA fingerprints. , 2000, Genomics.

[12]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[13]  Aurelian Radu,et al.  Oligonucleotide microarray data mining: search for age-dependent gene expression. , 2002, Biochemical and biophysical research communications.

[14]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[15]  D. Johnston,et al.  Mining the schistosome DNA sequence database. , 2001, Trends in parasitology.

[16]  Sung-Bae Cho,et al.  Machine Learning in DNA Microarray Analysis for Cancer Classification , 2003, APBC.

[17]  I. Gray,et al.  Single nucleotide polymorphisms as tools in human genetics. , 2000, Human molecular genetics.

[18]  Julie A. Johnson,et al.  Molecular diagnostics as a predictive tool: genetics of drug efficacy and toxicity. , 2002, Trends in molecular medicine.

[19]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[20]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[21]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[22]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[23]  Wei Huang,et al.  Single nucleotide polymorphisms in CAPN10 gene of Chinese people and its correlation with type 2 diabetes mellitus in Han people of northern China. , 2002, Biomedical and environmental sciences : BES.

[24]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Jeffrey T. Chang,et al.  Basic microarray analysis: grouping and feature reduction. , 2001, Trends in biotechnology.

[26]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[27]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[28]  Zehong Yang,et al.  Feature selection in recognition of handwritten Chinese characters , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[29]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[30]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[31]  R. Somogyi,et al.  The application of shannon entropy in the identification of putative drug targets. , 2000, Bio Systems.

[32]  Kenneth DeJong,et al.  Genetic algorithms as a tool for restructuring feature space representations , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[33]  J. Ross,et al.  A Test Case of Correlation Metric Construction of a Reaction Pathway from Measurements , 1997 .

[34]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .