On Model Selection Algorithms in Multi-dimensional Contingency Tables

We present a review focussed on model selection in log-linear models and contingency tables. The concepts of sparsity and high-dimensionality have become more important nowadays, for example, in the context of high-throughput genetic data. In particular, we describe recently developed automatic search algorithms for finding optimal hierarchical log-linear models (HLLMs) in sparse multi-dimensional contingency tables in R and some LASSO-type penalized likelihood model selection approaches. The methods rely, in part, on a new result which identifies and thus permits the rapid elimination of non-existent maximum likelihood estimators in high-dimensional tables.

[1]  Ashesh B Jani,et al.  An electronic application for rapidly calculating Charlson comorbidity score , 2004, BMC Cancer.

[2]  Stephen E. Fienberg,et al.  Maximum likelihood estimation in log-linear models , 2011, 1104.3618.

[3]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[4]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[5]  S. S. Wilks The Likelihood Test of Independence in Contingency Tables , 1935 .

[6]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[7]  A R Feinstein,et al.  THE PRE-THERAPEUTIC CLASSIFICATION OF CO-MORBIDITY IN CHRONIC DISEASE. , 1970, Journal of chronic diseases.

[8]  D. Edwards Introduction to graphical modelling , 1995 .

[9]  D. Zelterman Goodness-of-Fit Tests for Large Sparse Multinomial Distributions , 1987 .

[10]  A. Agresti Categorical data analysis , 1993 .

[11]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[12]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[13]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2018, Journal of the Royal Statistical Society Series A (Statistics in Society).

[14]  Generating Classes for Log‐Linear Models , 1990 .

[15]  C. Mackenzie,et al.  A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. , 1987, Journal of chronic diseases.

[16]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[17]  L. A. Goodman The Analysis of Multidimensional Contingency Tables: Stepwise Procedures and Direct Estimation Methods for Building Models for Multiple Classifications , 1971 .

[18]  W. Deming,et al.  On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known , 1940 .

[19]  B. Silverman,et al.  Nonparametric Regression and Generalized Linear Models: A roughness penalty approach , 1993 .

[20]  Susana Conde,et al.  LASSO penalised likelihood in high-dimensional contingency tables , 2011 .

[21]  S. Davies,et al.  Quantifying comorbidity in peritoneal dialysis patients and its relationship to other predictors of survival. , 2002, Nephrology, dialysis, transplantation : official publication of the European Dialysis and Transplant Association - European Renal Association.

[22]  B. Silverman,et al.  Nonparametric regression and generalized linear models , 1994 .

[23]  T. Speed,et al.  On the Existence of Maximum Likelihood Estimators for Hierarchical Loglinear Models , 1988 .

[24]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[25]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[26]  D. Edwards,et al.  A fast procedure for model search in multidimensional contingency tables , 1985 .

[27]  P. W. Lane,et al.  Zero entries in contingency tables , 1985 .

[28]  Sung-Ho Kim,et al.  Estimate-based goodness-of-fit test for large sparse multinomial distributions , 2009, Comput. Stat. Data Anal..

[29]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[30]  Peter Bühlmann,et al.  Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries , 2007, BMC Bioinformatics.

[31]  A. Nijenhuis Combinatorial algorithms , 1975 .

[32]  Nathan Mantel,et al.  INCOMPLETE CONTINGENCY TABLES , 1970 .

[33]  T. Speed,et al.  Markov Fields and Log-Linear Interaction Models for Contingency Tables , 1980 .

[34]  Susana Conde,et al.  Modelling high dimensional sets of binary co-morbidities , 2007 .

[35]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[36]  Alessandro Rinaldo,et al.  Computing Maximum Likelihood Estimates in Log-Linear Models , 2006 .

[37]  A. Wald Tests of statistical hypotheses concerning several parameters when the number of observations is large , 1943 .

[38]  Karl Pearson F.R.S. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[39]  Annette J. Dobson,et al.  An introduction to generalized linear models , 1991 .

[40]  Karen A. F. Copeland Design and Analysis of Experiments, 5th Ed. , 2001 .

[41]  Albert Maydeu-Olivares,et al.  Limited- and Full-Information Estimation and Goodness-of-Fit Testing in 2n Contingency Tables , 2005 .

[42]  Ronald Christensen,et al.  Log-Linear Models and Logistic Regression , 1997 .

[43]  Helge Toutenburg,et al.  Role of Categorical Variables in Multicollinearity in the Linear Regression Model , 2007 .

[44]  Stephen E. Fienberg,et al.  The Analysis of Incomplete Multi-Way Contingency Tables , 1972 .

[45]  David Edwards A note on adding and deleting edges in hierarchical log-linear models , 2012, Comput. Stat..

[46]  L. A. Goodman The Analysis of Cross-Classified Data: Independence, Quasi-Independence, and Interactions in Contingency Tables with or without Missing Entries , 1968 .

[47]  J. Friedman Fast sparse regression and classification , 2012 .

[48]  R. Fildes Journal of the Royal Statistical Society (B): Gary K. Grunwald, Adrian E. Raftery and Peter Guttorp, 1993, “Time series of continuous proportions”, 55, 103–116.☆ , 1993 .

[49]  Y. Bishop,et al.  Full Contingency Tables, Logits, and Split Contingency Tables , 1969 .

[50]  M. O'Flaherty,et al.  Algorithm AS 172: Direct Simulation of Nested Fortran DO-LOOPS , 1982 .

[51]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[52]  M. W. Birch Maximum Likelihood in Three-Way Contingency Tables , 1963 .