Journal of the American Statistical Association Nonparametric Estimation for Censored Mixture Data with Application to the Cooperative Huntington's Observational Research Trial Nonparametric Estimation for Censored Mixture Data with Application to the Cooperative Huntington's Observational Research

This work presents methods for estimating genotype-specific outcome distributions from genetic epidemiology studies where the event times are subject to right censoring, the genotypes are not directly observed, and the data arise from a mixture of scientifically meaningful subpopulations. Examples of such studies include kin-cohort studies and quantitative trait locus (QTL) studies. Current methods for analyzing censored mixture data include two types of nonparametric maximum likelihood estimators (NPMLEs; Type I and Type II) that do not make parametric assumptions on the genotype-specific density functions. Although both NPMLEs are commonly used, we show that one is inefficient and the other inconsistent. To overcome these deficiencies, we propose three classes of consistent nonparametric estimators that do not assume parametric density models and are easy to implement. They are based on inverse probability weighting (IPW), augmented IPW (AIPW), and nonparametric imputation (IMP). AIPW achieves the efficiency bound without additional modeling assumptions. Extensive simulation experiments demonstrate satisfactory performance of these estimators even when the data are heavily censored. We apply these estimators to the Cooperative Huntington’s Observational Research Trial (COHORT), and provide age-specific estimates of the effect of mutation in the Huntington gene on mortality using a sample of family members. The close approximation of the estimated noncarrier survival rates to that of the U.S. population indicates small ascertainment bias in the COHORT family sample. Our analyses underscore an elevated risk of death in Huntington gene mutation carriers compared with that in noncarriers for a wide age range, and suggest that the mutation equally affects survival rates in both genders. The estimated survival rates are useful in genetic counseling for providing guidelines on interpreting the risk of death associated with a positive genetic test, and in helping future subjects at risk to make informed decisions on whether to undergo genetic mutation testing. Technical details and additional numerical results are provided in the online supplementary materials.

[1]  Fei Zou,et al.  Nonparametric estimation of the effects of quantitative trait loci. , 2004, Biostatistics.

[2]  James Waterman Glover United States Life Tables , 2013 .

[3]  Peter S. Harper,et al.  Huntington's disease , 1991 .

[4]  P. Hartge,et al.  The risk of cancer associated with specific mutations of BRCA1 and BRCA2 among Ashkenazi Jews. , 1997, The New England journal of medicine.

[5]  Zhangsheng Yu,et al.  Nonparametric regression using local kernel estimating equations for correlated failure time data , 2008 .

[6]  E Ray Dorsey,et al.  Communicating clinical trial results to research participants. , 2008, Archives of neurology.

[7]  J. Robins,et al.  Inference for imputation estimators , 2000 .

[8]  Jane S. Paulsen,et al.  A new model for prediction of the age of onset and penetrance for Huntington's disease based on CAG length , 2004, Clinical genetics.

[9]  Marie Davidian,et al.  Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates , 2008, Biometrics.

[10]  Gerda Claeskens,et al.  Nonparametric Estimation , 2011, International Encyclopedia of Statistical Science.

[11]  Rongling Wu,et al.  Statistical Genetics of Quantitative Traits: Linkage, Maps and QTL , 2007 .

[12]  Anastasios A Tsiatis,et al.  Median Regression with Censored Cost Data , 2002, Biometrics.

[13]  E. Siemers,et al.  Rate of functional decline in Huntington’s disease , 2000, Neurology.

[14]  Alan Hubbard,et al.  Locally efficient estimation of the survival distribution with right-censored data and covariates when collection of data is delayed , 1998 .

[15]  P. Hartge,et al.  Potential Excess Mortality in BRCA1/2 Mutation Carriers beyond Breast, Ovarian, Prostate, and Pancreatic Cancers, and Melanoma , 2009, PloS one.

[16]  T. Foroud,et al.  Differences in duration of Huntington’s disease based on age at onset , 1999, Journal of neurology, neurosurgery, and psychiatry.

[17]  Karen Marder,et al.  Risk of Parkinson disease in carriers of parkin mutations: estimation using the kin-cohort method. , 2008, Archives of neurology.

[18]  P S Harper,et al.  Phenotypic characterization of individuals with 30-40 CAG repeats in the Huntington disease (HD) gene reveals HD cases with 36 repeats and apparently normal elderly individuals with 36-39 repeats. , 1996, American journal of human genetics.

[19]  E. Arias,et al.  United States life tables, 2005. , 2010, National vital statistics reports : from the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System.

[20]  Anastasios A. Tsiatis,et al.  Estimating medical costs with censored data , 2000 .

[21]  C. Begg On the use of familial aggregation in population-based case probands for calculating penetrance. , 2002, Journal of the National Cancer Institute.

[22]  T. Beaty,et al.  Fundamentals of Genetic Epidemiology , 1993 .

[23]  Yanyuan Ma,et al.  Efficient distribution estimation for data with unobserved sub-population identifiers. , 2012, Electronic journal of statistics.

[24]  Daniel Rabinowitz,et al.  COMPUTING THE EFFICIENT SCORE IN SEMI-PARAMETRIC PROBLEMS , 2000 .

[25]  S Wacholder,et al.  The kin-cohort study for estimating penetrance. , 1998, American journal of epidemiology.

[26]  Rongling Wu,et al.  Wavelet-Based Nonparametric Functional Mapping of Longitudinal Curves , 2008 .

[27]  Rongling Wu,et al.  Comprar Statistical Genetics of Quantitative Traits · Linkage, Maps and QTL | Casella, George | 9780387203348 | Springer , 2007 .

[28]  Karen Marder,et al.  Nonparametric estimation of age-at-onset distributions from censored kin-cohort data , 2007 .

[29]  K. Do,et al.  Efficient and Adaptive Estimation for Semiparametric Models. , 1994 .

[30]  E D Louis,et al.  Accuracy of family history data on Parkinson’s disease , 2003, Neurology.

[31]  James M. Robins,et al.  Large-sample theory for parametric multiple imputation procedures , 1998 .

[32]  A. Schwartz,et al.  Analysis of age of onset data from case-control family studies. , 1998, Biometrics.

[33]  Donald Fraser,et al.  Nonparametric Estimation IV , 1951 .

[34]  Manish S. Shah,et al.  A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes , 1993, Cell.

[35]  Rongling Wu,et al.  A logistic mixture model for characterizing genetic determinants causing differentiation in growth trajectories. , 2002, Genetical research.

[36]  Rongling Wu,et al.  The case for molecular mapping in forest tree breeding. , 2010 .

[37]  S Wacholder,et al.  A Marginal Likelihood Approach for Estimating Penetrance from Kin‐Cohort Designs , 2001, Biometrics.

[38]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[39]  E. Lander,et al.  Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. , 1989, Genetics.

[40]  Anastasios A. Tsiatis,et al.  Locally efficient semiparametric estimators for functional measurement error models , 2004 .

[41]  D. Lin,et al.  Semiparametric Methods for Mapping Quantitative Trait Loci with Censored Data , 2005, Biometrics.

[42]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[43]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[44]  J. Robins,et al.  Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers , 1992 .

[45]  Donglin Zeng,et al.  Maximum likelihood estimation in semiparametric regression models with censored data , 2007, Statistica Sinica.

[46]  Joseph G. Ibrahim,et al.  A Weighted Estimating Equation for Missing Covariate Data with Properties Similar to Maximum Likelihood , 1999 .