Statistical Methods of SNP Data Analysis and Applications

We develop various statistical methods important for multidimensional genetic data analysis. Theorems justifying application of these methods are established. We concentrate on the multifactor dimensionality reduction, logic regression, random forests, stochastic gradient boosting along with their new modifications. We use complementary approaches to study the risk of complex diseases such as cardiovascular ones. The roles of certain combinations of single nucleotide polymorphisms and non-genetic risk factors are examined. To perform the data analysis concerning the coronary heart disease and myocardial infarction the Lomonosov Moscow State University supercomputer “Chebyshev” was employed.

[1]  H. Firth,et al.  Comprar Oxford Handbook of Genetics | Guy Bradley-Smith | 9780199545360 | Oxford University Press , 2009 .

[2]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[3]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[4]  Scott M. Williams,et al.  A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction , 2007, Genetic epidemiology.

[5]  Holger Schwender,et al.  Identification of SNP interactions using logic regression. , 2008, Biostatistics.

[6]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[7]  T. Hansen,et al.  A Bayesian Multilocus Association Method: Allowing for Higher-Order Interaction in Association Studies , 2007, Genetics.

[8]  T. Ogihara,et al.  Identification of Hypertension-Susceptibility Genes and Pathways by a Systemic Multiple Candidate Gene Approach: The Millennium Genome Project for Hypertension , 2008, Hypertension Research.

[9]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[10]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[11]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[12]  Taesung Park,et al.  Log-linear model-based multifactor dimensionality reduction method to detect gene-gene interactions , 2007, Bioinform..

[13]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[14]  Alison A Motsinger,et al.  The effect of reduction in cross‐validation intervals on the performance of multifactor dimensionality reduction , 2006, Genetic epidemiology.

[15]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[16]  J. A. Bondy,et al.  Graph Theory , 2008, Graduate Texts in Mathematics.

[17]  Qiang Yang,et al.  MegaSNPHunter: a learning approach to detect disease predisposition SNPs and high level interactions in genome wide association study , 2009, BMC Bioinformatics.

[18]  J. Friedman Stochastic gradient boosting , 2002 .

[19]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  Junyong Park,et al.  Independent rule in classification of multivariate binary data , 2009, J. Multivar. Anal..

[22]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[23]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[24]  Holger Schwender,et al.  Empirical Bayes Analysis of Single Nucleotide Polymorphisms Empirical Bayes Analysis of Single Nucleotide Polymorphisms , 2008 .

[25]  Sayan Mukherjee,et al.  Permutation Tests for Classification , 2005, COLT.

[26]  Ingo Wegener,et al.  Detecting high-order interactions of single nucleotide polymorphisms using genetic programming , 2007, Bioinform..

[27]  Vineet Bafna,et al.  RAPID detection of gene-gene interactions in genome-wide association studies , 2010, Bioinform..

[28]  D. Cox The Analysis of Multivariate Binary Data , 1972 .

[29]  Samuel P. Dickson,et al.  Interpretation of association signals and identification of causal variants from genome-wide association studies. , 2010, American journal of human genetics.

[30]  G. Rossi,et al.  Association of gene polymorphisms with coronary artery disease in low- or high-risk subjects defined by conventional risk factors. , 2004, Journal of the American College of Cardiology.

[31]  Holger Schwender,et al.  Testing SNPs and sets of SNPs for importance in association studies. , 2011, Biostatistics.

[32]  Yan V. Sun,et al.  Machine learning in genome‐wide association studies , 2009, Genetic epidemiology.

[33]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[34]  Alison A. Motsinger-Reif,et al.  A comparison of internal validation techniques for multifactor dimensionality reduction , 2010, BMC Bioinformatics.

[35]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[36]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[37]  Nicholas L. Smith,et al.  SHARE: an adaptive algorithm to select the most informative set of SNPs for candidate genetic association , 2009, Biostatistics.

[38]  Kathryn A. Dowsland,et al.  Simulated Annealing , 1989, Encyclopedia of GIS.

[39]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[40]  Jason H. Moore,et al.  An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene Interactions on risk of myocardial infarction: The importance of model validation , 2004, BMC Bioinformatics.

[41]  T. Rydén,et al.  Fast simulated annealing in R-d with an application to maximum likelihood estimation in state-space models , 2009 .

[42]  Bruce E. Hajek,et al.  Cooling Schedules for Optimal Annealing , 1988, Math. Oper. Res..

[43]  Arpad Kelemen,et al.  Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases , 2008, 0803.4065.

[44]  Thomas Lumley,et al.  Logic regression for analysis of the association between genetic variation in the renin-angiotensin system and myocardial infarction or stroke. , 2006, American journal of epidemiology.

[45]  T. Murohara,et al.  Preventive cardiology: abstractAssociation of gene polymorphisms with coronary artery disease in low- or high-risk subjects defined by conventional risk factors , 2004 .

[46]  Jason H. Moore,et al.  Renin-angiotensin system gene polymorphisms and coronary artery disease in a large angiographic cohort: detection of high order gene-gene interaction. , 2007, Atherosclerosis.

[47]  Rachel Karchin,et al.  Next generation tools for the annotation of human SNPs , 2009, Briefings Bioinform..

[48]  T. Hu,et al.  STRONG LAWS OF LARGE NUMBERS FOR ARRAYS OF ROWWISE INDEPENDENT RANDOM ELEMENTS , 1987 .

[49]  Katja Ickstadt,et al.  Comparing Logic Regression Based Methods for Identifying SNP Interactions , 2007, BIRD.