Variable selection for random effects two-part models

Random effects two-part models have been applied to longitudinal studies for zero-inflated (or semi-continuous) data, characterized by a large portion of zero values and continuous non-zero (positive) values. Examples include monthly medical costs, daily alcohol drinks, relative abundance of microbiome, etc. With the advance of information technology for data collection and storage, the number of variables available to researchers can be rather large in such studies. To avoid curse of dimensionality and facilitate decision making, it is critically important to select covariates that are truly related to the outcome. However, owing to its intricate nature, there is not yet a satisfactory variable selection method available for such sophisticated models. In this paper, we seek a feasible way of conducting variable selection for random effects two-part models on the basis of the recently proposed “minimum information criterion” (MIC) method. We demonstrate that the MIC formulation leads to a reasonable formulation of sparse estimation, which can be conveniently solved with SAS Proc NLMIXED. The performance of our approach is evaluated through simulation, and an application to a longitudinal alcohol dependence study is provided.

[1]  Yingying Fan,et al.  Interaction pursuit in high-dimensional multi-response regression via distance correlation , 2016, 1605.03315.

[2]  William A. Knaus,et al.  A random effects four-part model, with application to correlated medical costs , 2008, Comput. Stat. Data Anal..

[3]  J. Mullahy Much Ado About Two: Reconsidering Retransformation and the Two-Part Model in Health Economics , 1998, Journal of health economics.

[4]  Joseph L Schafer,et al.  A Two-Part Random-Effects Model for Semicontinuous Longitudinal Data , 2001 .

[5]  H. Chai,et al.  Use of log‐skew‐normal distribution in analysis of continuous data with a discrete component at zero , 2008, Statistics in medicine.

[6]  Lei Liu,et al.  Analyzing repeated measures semi-continuous data, with application to an alcohol dependence study , 2016, Statistical methods in medical research.

[7]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[8]  B. Efron Estimation and Accuracy After Model Selection , 2014, Journal of the American Statistical Association.

[9]  N. Johnston,et al.  A probit- log- skew-normal mixture model for repeated measures data with excess zeros, with application to a cohort study of paediatric respiratory symptoms , 2010, BMC medical research methodology.

[10]  W. Manning,et al.  The logged dependent variable, heteroscedasticity, and the retransformation problem. , 1998, Journal of health economics.

[11]  N. Breslow,et al.  Approximate inference in generalized linear mixed models , 1993 .

[12]  A. Belloni,et al.  Square-Root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming , 2010, 1009.5689.

[13]  Lei Liu,et al.  A multi‐level two‐part random effects model, with application to an alcohol‐dependence study , 2008, Statistics in medicine.

[14]  Jonathan E. Taylor,et al.  Selective inference with a randomized response , 2015, 1507.06739.

[15]  Gary K Grunwald,et al.  Analysis of repeated measures data with clumping at zero , 2002, Statistical methods in medical research.

[16]  P. Armitage,et al.  Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. , 1976, British Journal of Cancer.

[17]  S. Raudenbush,et al.  Maximum Likelihood for Generalized Linear Models with Nested Random Effects via High-Order, Multivariate Laplace Approximation , 2000 .

[18]  R. Carroll,et al.  A new statistical method for estimating the usual intake of episodically consumed foods with application to their distribution. , 2006, Journal of the American Dietetic Association.

[19]  C. Morris,et al.  A Comparison of Alternative Models for the Demand for Medical Care , 1983 .

[20]  Richard H. Jones,et al.  Bayesian information criterion for longitudinal and clustered data , 2011, Statistics in medicine.

[21]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[22]  Linda C. Sobell,et al.  Timeline Follow-Back A Technique for Assessing Self-Reported Alcohol Consumption , 1992 .

[23]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[24]  Lei Liu,et al.  A flexible two-part random effects model for correlated medical costs. , 2010, Journal of health economics.

[25]  Ying Zhang,et al.  Sparse estimation of Cox proportional hazards models via approximated information criteria , 2016, Biometrics.

[26]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[27]  Jeremy MG Taylor,et al.  Subgroup Identification in Personalized Treatment of Alcohol Dependence. , 2015, Alcoholism, clinical and experimental research.

[28]  Brian Neelon,et al.  Modeling zero‐modified count and semicontinuous data in health services research Part 1: background and overview , 2016, Statistics in medicine.

[29]  Gene H. Golub,et al.  Calculation of Gauss quadrature rules , 1967, Milestones in Matrix Computation.

[30]  Ming D. Li,et al.  Determination of genotype combinations that can predict the outcome of the treatment of alcohol dependence using the 5-HT(3) antagonist ondansetron. , 2013, The American journal of psychiatry.

[31]  A. Buja,et al.  Valid post-selection inference , 2013, 1306.1059.

[32]  Paul S. Albert,et al.  Modelling longitudinal semicontinuous emesis volume data with serial correlation in an acupuncture clinical trial , 2005 .

[33]  Dennis L. Sun,et al.  Optimal Inference After Model Selection , 2014, 1410.2597.

[34]  Giovanni Parmigiani,et al.  A Comparison of Alternative Models Applied to Stroke , 1998 .

[35]  Jianqing Fan,et al.  Variable Selection for Cox's proportional Hazards Model and Frailty Model , 2002 .

[36]  Lu Tian,et al.  A two‐part model for censored medical cost data , 2007, Statistics in medicine.

[37]  Ziyad Mahfoud,et al.  What Is an Intracluster Correlation Coefficient? Crucial Concepts for Primary Care Researchers , 2004, The Annals of Family Medicine.

[38]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[39]  Somnath Datta,et al.  Marginal Analyses of Clustered Data When Cluster Size Is Informative , 2003, Biometrics.

[40]  John Weiner,et al.  Letter to the Editor , 1992, SIGIR Forum.

[41]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[42]  Ming D. Li,et al.  Pharmacogenetic approach at the serotonin transporter gene as a method of reducing the severity of alcohol drinking. , 2011, The American journal of psychiatry.

[43]  P. Albert Comment on Lu, et. al. 2004: Analyzing excessive no changes in clinical trials with clustered data. , 2005, Biometrics.

[44]  N. Breslow,et al.  Bias Correction in Generalized Linear Mixed Models with Multiple Components of Dispersion , 1996 .