论文信息 - An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions

An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions

MOTIVATION In both genome-wide association studies (GWAS) and pathway analysis, the modest sample size relative to the number of genetic markers presents formidable computational, statistical and methodological challenges for accurately identifying markers/interactions and for building phenotype-predictive models. RESULTS We address these objectives via maximum entropy conditional probability modeling (MECPM), coupled with a novel model structure search. Unlike neural networks and support vector machines (SVMs), MECPM makes explicit and is determined by the interactions that confer phenotype-predictive power. Our method identifies both a marker subset and the multiple k-way interactions between these markers. Additional key aspects are: (i) evaluation of a select subset of up to five-way interactions while retaining relatively low complexity; (ii) flexible single nucleotide polymorphism (SNP) coding (dominant, recessive) within each interaction; (iii) no mathematical interaction form assumed; (iv) model structure and order selection based on the Bayesian Information Criterion, which fairly compares interactions at different orders and automatically sets the experiment-wide significance level; (v) MECPM directly yields a phenotype-predictive model. MECPM was compared with a panel of methods on datasets with up to 1000 SNPs and up to eight embedded penetrance function (i.e. ground-truth) interactions, including a five-way, involving less than 20 SNPs. MECPM achieved improved sensitivity and specificity for detecting both ground-truth markers and interactions, compared with previous methods. AVAILABILITY http://www.cbil.ece.vt.edu/ResearchOngoingSNP.htm

[1] Gene Kim,et al. Application of Support Vector Machine to detect an association between a disease or trait and multiple SNP variations , 2001, ArXiv.

[2] Todd Holden,et al. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.

[3] 김삼묘,et al. “Bioinformatics” 특집을 내면서 , 2000 .

[4] G. Kesidis,et al. Scalable, Efficient, Stepwise-Optimal Feature Elimination in Support Vector Machines , 2007, 2007 IEEE Workshop on Machine Learning for Signal Processing.

[5] David G. Stork,et al. Pattern Classification , 1973 .

[6] G. Schwarz. Estimating the Dimension of a Model , 1978 .

[7] P. Donnelly,et al. Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[8] Alan Agresti,et al. Categorical Data Analysis , 2003 .

[9] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[10] J. H. Moore,et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[11] Jun S. Liu,et al. Bayesian inference of epistatic interactions in case-control studies , 2007, Nature Genetics.

[12] Evgueni A. Haroutunian,et al. Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[13] Yi Wang,et al. Exploration of gene–gene interaction effects using entropy-based methods , 2008, European Journal of Human Genetics.

[14] Eric R. Ziegel,et al. The Elements of Statistical Learning , 2003, Technometrics.

[15] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[16] Adam L. Berger,et al. A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[17] J. Rissanen,et al. Modeling By Shortest Data Description* , 1978, Autom..

[18] J Ott,et al. Analysis of complex traits using neural networks , 1999, Genetic epidemiology.

[19] Jason H. Moore,et al. Evaporative cooling feature selection for genotypic data involving interactions , 2007, Bioinform..

[20] M. Saraee,et al. Entropy-Based Epistasy Search in SNP Case-Control Studies , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[21] Thomas Lumley,et al. Logic regression for analysis of the association between genetic variation in the renin-angiotensin system and myocardial infarction or stroke. , 2006, American journal of epidemiology.

[22] John D. Lafferty,et al. Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[23] W. Willett,et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer , 2007, Nature Genetics.

[24] Song-Chun Zhu,et al. Minimax Entropy Principle and Its Application to Texture Modeling , 1997, Neural Computation.

[25] E. T. Jaynes,et al. Papers on probability, statistics and statistical physics , 1983 .

[26] D. Allison,et al. Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[27] T. Nagylaki,et al. A model for the genetics of handedness. , 1972, Genetics.