Maximum likelihood inference on a mixed conditionally and marginally specified regression model for genetic epidemiologic studies with two-phase sampling

Two-phase stratified sampling designs can reduce the cost of genetic epidemiologic studies by limiting expensive ascertainments of genetic and environmental exposure to an efficiently selected subsample (phase II) of the main study (phase I). Family history and some covariate information, which may be cheaply gathered for all subjects at phase I, can be used for sampling of informative subjects at phase II. We develop alternative maximum likelihood methods for analysis of data from such studies by using a novel regression model that permits the estimation of 'marginal' risk parameters that are associated with the genetic and environmental covariates of interest, while simultaneously characterizing the 'conditional' risk of the disease associated with family history after adjusting for the other covariates. The methods and appropriate asymptotic theories are developed with and without an assumption of gene-environment independence, allowing the distribution of the environmental factors to remain non-parametric. The performance of the alternative methods and of sampling strategies is studied by using simulated data involving rare and common genetic variants. An application of the methods proposed is illustrated by using a case-control study of colorectal adenoma embedded within the prostate, lung, colorectal and ovarian cancer screening trial. Copyright Royal Statistical Society.

[1]  J E White,et al.  A two stage design for the study of the relationship between a rare exposure and a rare disease. , 1982, American journal of epidemiology.

[2]  Margaret S. Pepe,et al.  A mean score method for missing and auxiliary covariate data in regression models , 1995 .

[3]  J M Neuhaus,et al.  Family‐Specific Approaches to the Analysis of Case–Control Family Data , 2006, Biometrics.

[4]  A. M. Walker,et al.  Anamorphic analysis: sampling and estimation for covariate effects when both exposure and disease are known. , 1982, Biometrics.

[5]  Raymond J Carroll,et al.  Analysis of case‐control studies of genetic and environmental factors with missing genetic information and haplotype‐phase ambiguity , 2005, Genetic epidemiology.

[6]  Nilanjan Chatterjee,et al.  Design and analysis of two‐phase studies with binary outcome applied to Wilms tumour prognosis , 1999 .

[7]  Chris J. Wild,et al.  Fitting prospective regression models to case-control data , 1991 .

[8]  J. Neyman Contribution to the Theory of Sampling Human Populations , 1938 .

[9]  L. Kupper,et al.  Inferences About Exposure-Disease Associations Using Probability-of-Exposure Information , 1993 .

[10]  R. Pyke,et al.  Logistic disease incidence models and case-control studies , 1979 .

[11]  Robert V. Foutz,et al.  On the Unique Consistent Solution to the Likelihood Equations , 1977 .

[12]  Norman E. Breslow,et al.  Logistic regression for two-stage case-control data , 1988 .

[13]  A. Scott,et al.  Fitting regression models to case-control data by maximum likelihood , 1997 .

[14]  Haibo Zhou,et al.  An Estimated Likelihood Method for Continuous Outcome Regression Models With Outcome-Dependent Sampling , 2005 .

[15]  Yi-Hau Chen,et al.  A Pseudoscore Estimator for Regression Problems With Two-Phase Sampling , 2003 .

[16]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[17]  J. Gohagan,et al.  The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial of the National Cancer Institute: history, organization, and status. , 2000, Controlled clinical trials.

[18]  N. Rothman,et al.  NAT2 slow acetylation and bladder cancer risk: a meta-analysis of 22 case-control studies conducted in the general population. , 2000, Pharmacogenetics.

[19]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[20]  S Greenland,et al.  Analytic methods for two-stage case-control studies and other stratified designs. , 1991, Statistics in medicine.

[21]  Nilanjan Chatterjee,et al.  Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies , 2005 .

[22]  J Halpern,et al.  Multi-stage sampling in genetic epidemiology. , 1997, Statistics in medicine.

[23]  Alastair Scott,et al.  The analysis of retrospective family studies , 2002 .

[24]  C R Weinberg,et al.  Designing and analysing case-control studies to exploit independence of genotype and exposure. , 1997, Statistics in medicine.

[25]  Nilanjan Chatterjee,et al.  Cigarette smoking, N-acetyltransferase genes and the risk of advanced colorectal adenoma. , 2006, Pharmacogenomics.

[26]  Thomas R. Fleming,et al.  A Nonparametric Method for Dealing with Mismeasured Covariate Data , 1991 .

[27]  Ross L. Prentice,et al.  Binary Regression Using an Extended Beta-Binomial Distribution, with Discussion of Correlation Induced by Covariate Measurement Errors , 1986 .

[28]  Jerald F. Lawless,et al.  Semiparametric methods for response‐selective and missing data problems in regression , 1999 .

[29]  Norman E. Breslow,et al.  Maximum Likelihood Estimation of Logistic Regression Parameters under Two‐phase, Outcome‐dependent Sampling , 1997 .