Sparse Partitioning: Nonlinear regression with binary or tertiary predictors, with application to association studies

This paper presents Sparse Partitioning, a Bayesian method for identifying predictors that either individually or in combination with others affect a response variable. The method is designed for regression problems involving binary or tertiary predictors and allows the number of predictors to exceed the size of the sample, two properties which make it well suited for association studies. Sparse Partitioning differs from other regression methods by placing no restrictions on how the predictors may influence the response. To compensate for this generality, Sparse Partitioning implements a novel way of exploring the model space. It searches for high posterior probability partitions of the predictor set, where each partition defines groups of predictors that jointly influence the response. The result is a robust method that requires no prior knowledge of the true predictor--response relationship. Testing on simulated data suggests Sparse Partitioning will typically match the performance of an existing method on a data set which obeys the existing method's model assumptions. When these assumptions are violated, Sparse Partitioning will generally offer superior performance.

[1]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[2]  A. Phillips,et al.  Reference ranges and sources of variability of CD4 counts in HIV-seronegative women and men. , 1996, Genitourinary medicine.

[3]  Bjarni J. Vilhjálmsson,et al.  Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines , 2010 .

[4]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[5]  A. Clark,et al.  Inference of haplotypes from PCR-amplified samples of diploid populations. , 1990, Molecular biology and evolution.

[6]  Keyan Zhao,et al.  An Arabidopsis Example of Association Mapping in Structured Samples , 2006, PLoS genetics.

[7]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[8]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[9]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[10]  Jun S. Liu,et al.  Bayesian inference of epistatic interactions in case-control studies , 2007, Nature Genetics.

[11]  R. Amasino,et al.  Molecular analysis of FRIGIDA, a major determinant of natural variation in Arabidopsis flowering time. , 2000, Science.

[12]  Thomas Mailund,et al.  Whole genome association mapping by incompatibilities and local perfect phylogenies , 2006, BMC Bioinformatics.

[13]  Andrew G. Clark,et al.  Mapping Multiple Quantitative Trait Loci by Bayesian Classification , 2005, Genetics.

[14]  William Valdar,et al.  A protocol for high-throughput phenotyping, suitable for quantitative trait analysis in mice , 2006, Mammalian Genome.

[15]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[16]  M. West,et al.  Shotgun Stochastic Search for “Large p” Regression , 2007 .

[17]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[18]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[19]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[20]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[21]  Claudio J. Verzilli,et al.  Bayesian graphical models for genomewide association studies. , 2006, American journal of human genetics.

[22]  M. Stephens,et al.  A new statistical method for haplotype reconstruction , 2013 .

[23]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[24]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[25]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[26]  Ingo Ruczinski,et al.  Identifying interacting SNPs using Monte Carlo logic regression , 2005, Genetic epidemiology.

[27]  G. Rosner,et al.  A modified forward multiple regression in high‐density genome‐wide association studies for complex traits , 2009, Genetic epidemiology.

[28]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[29]  Terence P. Speed,et al.  Sparse combinatorial inference with an application in cancer biology , 2009, Bioinform..

[30]  M. Stephens,et al.  Bayesian statistical methods for genetic association studies , 2009, Nature Reviews Genetics.

[31]  M. Nordborg,et al.  Role of FRIGIDA and FLOWERING LOCUS C in Determining Variation in Flowering Time of Arabidopsis1[w] , 2005, Plant Physiology.

[32]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[33]  Shizhong Xu,et al.  Bayesian Shrinkage Estimation of Quantitative Trait Loci Parameters , 2005, Genetics.

[34]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[35]  R. Redon,et al.  Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes , 2007, Science.

[36]  Ingo Ruczinski,et al.  Logic Regression — Methods and Software , 2003 .

[37]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .