Data mining, neural nets, trees — Problems 2 and 3 of Genetic Analysis Workshop 15

Genome‐wide association studies using thousands to hundreds of thousands of single nucleotide polymorphism (SNP) markers and region‐wide association studies using a dense panel of SNPs are already in use to identify disease susceptibility genes and to predict disease risk in individuals. Because these tasks become increasingly important, three different data sets were provided for the Genetic Analysis Workshop 15, thus allowing examination of various novel and existing data mining methods for both classification and identification of disease susceptibility genes, gene by gene or gene by environment interaction. The approach most often applied in this presentation group was random forests because of its simplicity, elegance, and robustness. It was used for prediction and for screening for interesting SNPs in a first step. The logistic tree with unbiased selection approach appeared to be an interesting alternative to efficiently select interesting SNPs. Machine learning, specifically ensemble methods, might be useful as pre‐screening tools for large‐scale association studies because they can be less prone to overfitting, can be less computer processor time intensive, can easily include pair‐wise and higher‐order interactions compared with standard statistical approaches and can also have a high capability for classification. However, improved implementations that are able to deal with hundreds of thousands of SNPs at a time are required. Genet. Epidemiol. 31(Suppl. 1):S51–S60, 2007. © 2007 Wiley‐Liss, Inc.

[1]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[2]  Qiong Yang,et al.  Two-stage approach for identifying single-nucleotide polymorphisms associated with rheumatoid arthritis using random forests and Bayesian networks , 2007, BMC proceedings.

[3]  C Kooperberg,et al.  Sequence Analysis Using Logic Regression , 2001, Genetic epidemiology.

[4]  Mark Gerstein,et al.  Information assessment on predicting protein-protein interactions , 2004, BMC Bioinformatics.

[5]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[6]  Carolin Strobl,et al.  Unbiased split selection for classification trees based on the Gini Index , 2007, Comput. Stat. Data Anal..

[7]  M. Mojirsheibani Combining Classifiers via Discretization , 1999 .

[8]  Eibe Frank,et al.  Logistic Model Trees , 2003, ECML.

[9]  Ingo Ruczinski,et al.  Identifying interacting SNPs using Monte Carlo logic regression , 2005, Genetic epidemiology.

[10]  H. Stassen,et al.  Modeling activation of inflammatory response system: a molecular-genetic neural network analysis , 2007, BMC proceedings.

[11]  Daniel Enache,et al.  Statistical Models and Artificial Neural Networks , 1996 .

[12]  I. König,et al.  A Statistical Approach to Genetic Epidemiology: Concepts and Applications , 2006 .

[13]  Keyan Zhao,et al.  Cladistic analysis of genotype data-application to GAW15 Problem 3 , 2007, BMC proceedings.

[14]  A. Ziegler,et al.  Haplotypes and haplotype‐tagging single‐nucleotide polymorphism: Presentation Group 8 of Genetic Analysis Workshop 14 , 2005, Genetic epidemiology.

[15]  Annette Lee,et al.  Data for Genetic Analysis Workshop (GAW) 15 Problem 2, genetic causes of rheumatoid arthritis and associated traits , 2007, BMC proceedings.

[16]  W. Loh,et al.  LOTUS: An Algorithm for Building Accurate and Comprehensible Logistic Regression Trees , 2004 .

[17]  W. Loh,et al.  Logistic Regression Tree Analysis , 2006 .

[18]  Burton H. Singer,et al.  Recursive partitioning in the health sciences , 1999 .

[19]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[22]  Na Li,et al.  Genetic Analysis Workshop 15: simulation of a complex genetic model for rheumatoid arthritis in nuclear families including a dense SNP map with linkage disequilibrium between marker loci and trait loci , 2007, BMC Proceedings.

[23]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[24]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[25]  E. Boerwinkle,et al.  A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping. I. Basic theory and an analysis of alcohol dehydrogenase activity in Drosophila. , 1987, Genetics.

[26]  I. König,et al.  Picking single-nucleotide polymorphisms in forests , 2007, BMC proceedings.

[27]  Yan V. Sun,et al.  Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests , 2007, BMC proceedings.

[28]  A. Allen,et al.  Summary of contributions to GAW15 Group 13: candidate gene association studies , 2007 .

[29]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[30]  Yong Wang,et al.  Using Model Trees for Classification , 1998, Machine Learning.

[31]  Valentin Milanov,et al.  Logistic regression trees for initial selection of interesting loci in case-control studies , 2007, BMC proceedings.

[32]  Dan Steinberg,et al.  THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING , 1998 .

[33]  Analyses of single marker and pairwise effects of candidate loci for rheumatoid arthritis using logistic regression and random forests , 2007, BMC proceedings.

[34]  Vladimir Koltchinskii,et al.  Three papers on boosting: An introduction , 2003 .

[35]  Y. Shugart,et al.  Stability of variable importance scores and rankings using statistical learning tools on single-nucleotide polymorphisms and risk factors involved in gene × gene and gene × environment interactions , 2007, BMC proceedings.

[36]  Alexander Platt Association mapping through heuristic evolutionary history reconstruction-application to GAW15 Problem 3 , 2007, BMC proceedings.

[37]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[38]  Grace Wahba,et al.  Detecting disease-causing genes by LASSO-Patternsearch algorithm , 2007, BMC proceedings.

[39]  Torsten Hothorn,et al.  Bundling Classifiers by Bagging Trees , 2002, Comput. Stat. Data Anal..