Latent class models for joint analysis of disease prevalence and high-dimensional semicontinuous biomarker data.

High-dimensional biomarker data are often collected in epidemiological studies when assessing the association between biomarkers and human disease is of interest. We develop a latent class modeling approach for joint analysis of high-dimensional semicontinuous biomarker data and a binary disease outcome. To model the relationship between complex biomarker expression patterns and disease risk, we use latent risk classes to link the 2 modeling components. We characterize complex biomarker-specific differences through biomarker-specific random effects, so that different biomarkers can have different baseline (low-risk) values as well as different between-class differences. The proposed approach also accommodates data features that are common in environmental toxicology and other biomarker exposure data, including a large number of biomarkers, numerous zero values, and complex mean-variance relationship in the biomarkers levels. A Monte Carlo EM (MCEM) algorithm is proposed for parameter estimation. Both the MCEM algorithm and model selection procedures are shown to work well in simulations and applications. In applying the proposed approach to an epidemiological study that examined the relationship between environmental polychlorinated biphenyl (PCB) exposure and the risk of endometriosis, we identified a highly significant overall effect of PCB concentrations on the risk of endometriosis.

[1]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[2]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[3]  Amy H Herring,et al.  Nonparametric Bayes Shrinkage for Assessing Exposures to Mixtures Subject to Limits of Detection , 2010, Epidemiology.

[4]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[5]  B. Whitcomb,et al.  Environmental PCB exposure and risk of endometriosis. , 2005, Human reproduction.

[6]  J. Ibrahim,et al.  Model Selection Criteria for Missing-Data Problems Using the EM Algorithm , 2008, Journal of the American Statistical Association.

[7]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[8]  Stan Lipovetsky,et al.  Latent Variable Models and Factor Analysis , 2001, Technometrics.

[9]  J. Ware,et al.  Random-effects models for longitudinal data. , 1982, Biometrics.

[10]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[11]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[12]  Chris Gennings,et al.  Identifying Subsets of Complex Mixtures Most Associated With Complex Diseases: Polychlorinated Biphenyls and Endometriosis as a Case Study , 2010, Epidemiology.

[13]  C. McCulloch Maximum Likelihood Algorithms for Generalized Linear Mixed Models , 1997 .

[14]  J. Booth,et al.  Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm , 1999 .

[15]  Scott L. Zeger,et al.  Latent Variable Regression for Multiple Discrete Outcomes , 1997 .

[16]  C. McCulloch,et al.  Latent Class Models for Joint Analysis of Longitudinal Biomarker and Event Process Data , 2002 .

[17]  D. Bartholomew Latent Variable Models And Factor Analysis , 1987 .

[18]  Joseph L Schafer,et al.  A Two-Part Random-Effects Model for Semicontinuous Longitudinal Data , 2001 .

[19]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .