Decomposition and Model Selection for Large Contingency Tables

Large contingency tables summarizing categorical variables arise in many areas. One example is in biology, where large numbers of biomarkers are cross‐tabulated according to their discrete expression level. Interactions of the variables are of great interest and are generally studied with log–linear models. The structure of a log–linear model can be visually represented by a graph from which the conditional independence structure can then be easily read off. However, since the number of parameters in a saturated model grows exponentially in the number of variables, this generally comes with a heavy computational burden. Even if we restrict ourselves to models of lower‐order interactions or other sparse structures, we are faced with the problem of a large number of cells which play the role of sample size. This is in sharp contrast to high‐dimensional regression or classification procedures because, in addition to a high‐dimensional parameter, we also have to deal with the analogue of a huge sample size. Furthermore, high‐dimensional tables naturally feature a large number of sampling zeros which often leads to the nonexistence of the maximum likelihood estimate. We therefore present a decomposition approach, where we first divide the problem into several lower‐dimensional problems and then combine these to form a global solution. Our methodology is computationally feasible for log–linear interaction models with many categorical variables each or some of them having many levels. We demonstrate the proposed method on simulated data and apply it to a bio‐medical problem in cancer research.

[1]  P. Holland,et al.  Discrete Multivariate Analysis. , 1976 .

[2]  T. Speed,et al.  Markov Fields and Log-Linear Interaction Models for Contingency Tables , 1980 .

[3]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[4]  T. Kim In vitro transcriptional activation of p21 promoter by p53. , 1997, Biochemical and biophysical research communications.

[5]  R. Christensen Linear Models for Multivariate, Time Series, and Spatial Data , 1997 .

[6]  Michael I. Jordan Graphical Models , 2003 .

[7]  S. van Buuren,et al.  Multivariate Imputation by Chained Equations : Mice V1.0 User's manual , 2000 .

[8]  O. Kallioniemi,et al.  Tissue microarray technology for high-throughput molecular profiling of cancer. , 2001, Human molecular genetics.

[9]  Kristian G. Olesen,et al.  Maximal Prime Subgraph Decomposition of Bayesian Networks , 2001, FLAIRS.

[10]  M. Loda,et al.  Expression of p27 and VHL in Renal Tumors , 2002, Applied immunohistochemistry & molecular morphology : AIMM.

[11]  T. Nikaido,et al.  Hypoxia attenuates the expression of E-cadherin via up-regulation of SNAIL in ovarian carcinoma cells. , 2003, The American journal of pathology.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  G. Camenisch,et al.  Integration of Oxygen Signaling at the Consensus HRE , 2005, Science's STKE.

[14]  Sung-Ho Kim Log-linear modelling for contingency tables by using marginal model structures , 2005 .

[15]  A. Haitel,et al.  Expression of aquaporins and PAX-2 compared to CD10 and cytokeratin 7 in renal neoplasms: a tissue microarray study , 2005, Modern Pathology.

[16]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[17]  E. Cho,et al.  p53 stabilization and transactivation by a von Hippel-Lindau protein. , 2006, Molecular cell.

[18]  Martin J. Wainwright,et al.  High-Dimensional Graphical Model Selection Using ℓ1-Regularized Logistic Regression , 2006, NIPS.

[19]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[20]  Peter Bühlmann,et al.  Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries , 2007, BMC Bioinformatics.

[21]  B. Schölkopf,et al.  High-Dimensional Graphical Model Selection Using ℓ1-Regularized Logistic Regression , 2007 .

[22]  Alistair G. Gray,et al.  Sequential category aggregation and partitioning approaches for multi-way contingency tables based on survey and census data , 2008, 0811.1686.

[23]  Peter Bühlmann,et al.  Mining Tissue Microarray Data to Uncover Combinations of Biomarker Expression Patterns that Improve Intermediate Staging and Grading of Clear Cell Renal Cell Cancer , 2009, Clinical Cancer Research.

[24]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .