Exploring dependence between categorical variables: Benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms

This manuscript is concerned with relating two approaches that can be used to explore complex dependence structures between categorical variables, namely Bayesian partitioning of the covariate space incorporating a variable selection procedure that highlights the covariates that drive the clustering, and log-linear modelling with interaction terms. We derive theoretical results on this relation and discuss if they can be employed to assist log-linear model determination, demonstrating advantages and limitations with simulated and real data sets. The main advantage concerns sparse contingency tables. Inferences from clustering can potentially reduce the number of covariates considered and, subsequently, the number of competing log-linear models, making the exploration of the model space feasible. Variable selection within clustering can inform on marginal independence in general, thus allowing for a more efficient exploration of the log-linear model space. However, we show that the clustering structure is not informative on the existence of interactions in a consistent manner. This work is of interest to those who utilize log-linear models, as well as practitioners such as epidemiologists that use clustering models to reduce the dimensionality in the data and to reveal interesting patterns on how covariates combine.

[1]  Brian J Reich,et al.  A spatial dirichlet process mixture model for clustering population genetics data. , 2011, Biometrics.

[2]  D. Edwards,et al.  A fast procedure for model search in multidimensional contingency tables , 1985 .

[3]  D. Dunson,et al.  Bayesian Selection and Clustering of Polymorphisms in Functionally Related Genes , 2008 .

[4]  Sylvia Richardson,et al.  Bayesian profile regression with an application to the National Survey of Children's Health. , 2010, Biostatistics.

[5]  D. Dunson,et al.  Simplex Factor Models for Multivariate Unordered Categorical Data , 2012, Journal of the American Statistical Association.

[6]  Albert Y. Lo,et al.  On a Class of Bayesian Nonparametric Estimates: I. Density Estimates , 1984 .

[7]  Wei Zhang,et al.  A Bayesian Partition Method for Detecting Pleiotropic and Epistatic eQTL Modules , 2010, PLoS Comput. Biol..

[8]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[9]  S. MacEachern,et al.  Estimating mixture of dirichlet process models , 1998 .

[10]  Jon Wakefield,et al.  Bayesian mixture modeling of gene‐environment and gene‐gene interactions , 2009, Genetic epidemiology.

[11]  A. Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[12]  T. Speed,et al.  Markov Fields and Log-Linear Interaction Models for Contingency Tables , 1980 .

[13]  David Dunson,et al.  Bayesian Factorizations of Big Sparse Tensors , 2013, Journal of the American Statistical Association.

[14]  J. Chimka Categorical Data Analysis, Second Edition , 2003 .

[15]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[16]  E. Riboli,et al.  Diet and cancer — the European Prospective Investigation into Cancer and Nutrition , 2004, Nature Reviews Cancer.

[17]  J. Huelsenbeck,et al.  Inference of Population Structure Under a Dirichlet Process Model , 2007, Genetics.

[18]  Sylvia Richardson,et al.  PReMiuM: An R Package for Profile Regression Mixture Models Using Dirichlet Processes. , 2013, Journal of statistical software.

[19]  A. Dobra Variable selection and dependency networks for genomewide data. , 2009, Biostatistics.

[20]  Purushottam W. Laud,et al.  Bayesian Nonparametric Inference for Random Distributions and Related Functions , 1999 .

[21]  Petros Dellaportas,et al.  A novel reversible jump algorithm for generalized linear models , 2011 .

[22]  Marc Chadeau-Hyam,et al.  ESS++: a C++ objected-oriented algorithm for Bayesian stochastic search model exploration , 2011, Bioinform..

[23]  D. Dunson,et al.  Nonparametric Bayes Modeling of Multivariate Categorical Data , 2009, Journal of the American Statistical Association.

[24]  P. Dellaportas,et al.  Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models , 1999 .

[25]  Samiran Sinha,et al.  Semiparametric Bayesian Analysis of Nutritional Epidemiology Data in the Presence of Measurement Error , 2010, Biometrics.

[26]  C. Geyer,et al.  Annealing Markov chain Monte Carlo with applications to ancestral inference , 1995 .

[27]  Paolo Vineis,et al.  Examining the Joint Effect of Multiple Risk Factors Using Exposure Risk Profiles: Lung Cancer in Nonsmokers , 2010, Environmental health perspectives.

[28]  James G. Scott,et al.  Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem , 2010, 1011.2333.

[29]  D. B. Dahl Bayesian Inference for Gene Expression and Proteomics: Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model , 2006 .

[30]  Matthias Heinig,et al.  New Insights into the Genetic Control of Gene Expression using a Bayesian Multi-tissue Approach , 2010, PLoS Comput. Biol..

[31]  P. Elliott,et al.  Size matters: just how big is BIG? , 2008, International journal of epidemiology.

[32]  David B Dunson,et al.  TENSOR DECOMPOSITIONS AND SPARSE LOG-LINEAR MODELS. , 2014, Annals of statistics.

[33]  Sylvia Richardson,et al.  Exploring Data From Genetic Association Studies Using Bayesian Variable Selection and the Dirichlet Process: Application to Searching for Gene × Gene Patterns , 2012, Genetic epidemiology.

[34]  Vincent Vandewalle,et al.  Model-Based Clustering for Conditionally Correlated Categorical Data , 2014, Journal of Classification.

[35]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[36]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[37]  P. Dellaportas,et al.  Bayesian variable and link determination for generalised linear models , 2003 .

[38]  Paolo Vineis,et al.  A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit genes on 15q25 , 2008, Nature.

[39]  Jonathan J. Forster,et al.  Reversible jump methods for generalised linear models and generalised linear mixed models , 2012, Stat. Comput..

[40]  D. Dunson,et al.  Nonparametric Bayes Conditional Distribution Modeling With Variable Selection , 2009, Journal of the American Statistical Association.

[41]  H. Massam,et al.  The mode oriented stochastic search (MOSS) algorithm for log-linear models with conjugate priors , 2010 .

[42]  S. Richardson,et al.  Bayesian Models for Sparse Regression Analysis of High Dimensional Data , 2012 .

[43]  David B. Dunson,et al.  Nonparametric Bayes inference on conditional independence , 2014, 1404.1429.

[44]  M. West,et al.  Shotgun Stochastic Search for “Large p” Regression , 2007 .

[45]  P. Green,et al.  Modelling Heterogeneity With and Without the Dirichlet Process , 2001 .

[46]  Steffen L. Lauritzen,et al.  Elements of Graphical Models , 2011 .

[47]  P. Fryzlewicz,et al.  High dimensional variable selection via tilting , 2012, 1611.08640.