An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions

MOTIVATION In both genome-wide association studies (GWAS) and pathway analysis, the modest sample size relative to the number of genetic markers presents formidable computational, statistical and methodological challenges for accurately identifying markers/interactions and for building phenotype-predictive models. RESULTS We address these objectives via maximum entropy conditional probability modeling (MECPM), coupled with a novel model structure search. Unlike neural networks and support vector machines (SVMs), MECPM makes explicit and is determined by the interactions that confer phenotype-predictive power. Our method identifies both a marker subset and the multiple k-way interactions between these markers. Additional key aspects are: (i) evaluation of a select subset of up to five-way interactions while retaining relatively low complexity; (ii) flexible single nucleotide polymorphism (SNP) coding (dominant, recessive) within each interaction; (iii) no mathematical interaction form assumed; (iv) model structure and order selection based on the Bayesian Information Criterion, which fairly compares interactions at different orders and automatically sets the experiment-wide significance level; (v) MECPM directly yields a phenotype-predictive model. MECPM was compared with a panel of methods on datasets with up to 1000 SNPs and up to eight embedded penetrance function (i.e. ground-truth) interactions, including a five-way, involving less than 20 SNPs. MECPM achieved improved sensitivity and specificity for detecting both ground-truth markers and interactions, compared with previous methods. AVAILABILITY http://www.cbil.ece.vt.edu/ResearchOngoingSNP.htm

[1]  Gene Kim,et al.  Application of Support Vector Machine to detect an association between a disease or trait and multiple SNP variations , 2001, ArXiv.

[2]  Todd Holden,et al.  A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.

[3]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[4]  G. Kesidis,et al.  Scalable, Efficient, Stepwise-Optimal Feature Elimination in Support Vector Machines , 2007, 2007 IEEE Workshop on Machine Learning for Signal Processing.

[5]  David G. Stork,et al.  Pattern Classification , 1973 .

[6]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[7]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[8]  Alan Agresti,et al.  Categorical Data Analysis , 2003 .

[9]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[10]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[11]  Jun S. Liu,et al.  Bayesian inference of epistatic interactions in case-control studies , 2007, Nature Genetics.

[12]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[13]  Yi Wang,et al.  Exploration of gene–gene interaction effects using entropy-based methods , 2008, European Journal of Human Genetics.

[14]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[17]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[18]  J Ott,et al.  Analysis of complex traits using neural networks , 1999, Genetic epidemiology.

[19]  Jason H. Moore,et al.  Evaporative cooling feature selection for genotypic data involving interactions , 2007, Bioinform..

[20]  M. Saraee,et al.  Entropy-Based Epistasy Search in SNP Case-Control Studies , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[21]  Thomas Lumley,et al.  Logic regression for analysis of the association between genetic variation in the renin-angiotensin system and myocardial infarction or stroke. , 2006, American journal of epidemiology.

[22]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  W. Willett,et al.  A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer , 2007, Nature Genetics.

[24]  Song-Chun Zhu,et al.  Minimax Entropy Principle and Its Application to Texture Modeling , 1997, Neural Computation.

[25]  E. T. Jaynes,et al.  Papers on probability, statistics and statistical physics , 1983 .

[26]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[27]  T. Nagylaki,et al.  A model for the genetics of handedness. , 1972, Genetics.