Two-phase and family-based designs for next-generation sequencing studies

The cost of next-generation sequencing is now approaching that of early GWAS panels, but is still out of reach for large epidemiologic studies and the millions of rare variants expected poses challenges for distinguishing causal from non-causal variants. We review two types of designs for sequencing studies: two-phase designs for targeted follow-up of genomewide association studies using unrelated individuals; and family-based designs exploiting co-segregation for prioritizing variants and genes. Two-phase designs subsample subjects for sequencing from a larger case-control study jointly on the basis of their disease and carrier status; the discovered variants are then tested for association in the parent study. The analysis combines the full sequence data from the substudy with the more limited SNP data from the main study. We discuss various methods for selecting this subset of variants and describe the expected yield of true positive associations in the context of an on-going study of second breast cancers following radiotherapy. While the sharing of variants within families means that family-based designs are less efficient for discovery than sequencing unrelated individuals, the ability to exploit co-segregation of variants with disease within families helps distinguish causal from non-causal ones. Furthermore, by enriching for family history, the yield of causal variants can be improved and use of identity-by-descent information improves imputation of genotypes for other family members. We compare the relative efficiency of these designs with those using unrelated individuals for discovering and prioritizing variants or genes for testing association in larger studies. While associations can be tested with single variants, power is low for rare ones. Recent generalizations of burden or kernel tests for gene-level associations to family-based data are appealing. These approaches are illustrated in the context of a family-based study of colorectal cancer.

[1]  G. Giles,et al.  Population-based estimate of the average age-specific cumulative risk of breast cancer for a defined set of protein-truncating mutations in BRCA1 and BRCA2. Australian Breast Cancer Family Study. , 1999, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[2]  Wei Pan,et al.  Comparison of statistical tests for disease association with rare variants , 2011, Genetic epidemiology.

[3]  Timothy R. Rebbeck,et al.  Assessing the function of genetic variants in candidate gene association studies , 2004, Nature Reviews Genetics.

[4]  D. Thomas,et al.  Some Surprising Twists on the Road to Discovering the Contribution of Rare Variants to Complex Diseases , 2013, Human Heredity.

[5]  D. Schaid,et al.  Two‐Phase Designs to Follow‐Up Genome‐Wide Association Signals With DNA Resequencing Studies , 2013, Genetic epidemiology.

[6]  M. O’Donovan,et al.  DNA Pooling: a tool for large-scale association studies , 2002, Nature Reviews Genetics.

[7]  Matthew J. Huentelman,et al.  IDENTIFICATION OF GENETIC VARIANTS USING BARCODED MULTIPLEXED SEQUENCING , 2008, Nature Methods.

[8]  Christoph Lange,et al.  Genomic screening and replication using the same data set in family-based association testing , 2005, Nature Genetics.

[9]  John S. Witte,et al.  Comprehensive Approach to Analyzing Rare Genetic Variants , 2010, PloS one.

[10]  A. Ashworth,et al.  BRCA1 and BRCA2 , 2000, Current Biology.

[11]  D. Thomas,et al.  Analysis and Optimal Design for Association Studies Using Next‐Generation Sequencing With Case‐Control Pools , 2012, Genetic epidemiology.

[12]  P. Stenson,et al.  The Human Gene Mutation Database: 2008 update , 2009, Genome Medicine.

[13]  Robert P. St.Onge,et al.  Highly-multiplexed barcode sequencing: an efficient method for parallel analysis of pooled samples , 2010, Nucleic acids research.

[14]  C. Begg,et al.  Two‐Stage Designs for Gene–Disease Association Studies , 2002, Biometrics.

[15]  A. Børresen-Dale,et al.  Radiation exposure, the ATM Gene, and contralateral breast cancer in the women's environmental cancer and radiation epidemiology study. , 2010, Journal of the National Cancer Institute.

[16]  Gary K. Chen,et al.  Enriching the analysis of genomewide association studies with hierarchical modeling. , 2007, American journal of human genetics.

[17]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[18]  M A Quintana,et al.  Integrative variable selection via Bayesian model uncertainty , 2013, Statistics in medicine.

[19]  S Greenland,et al.  Principles of multilevel modelling. , 2000, International journal of epidemiology.

[20]  David V Conti,et al.  Use of pathway information in molecular epidemiology , 2009, Human Genomics.

[21]  P. Taberlet,et al.  Genotyping errors: causes, consequences and solutions , 2005, Nature Reviews Genetics.

[22]  Karl-Heinz Jöckel,et al.  Logistic analysis in case-control studies under validation sampling , 1993 .

[23]  E. Zeggini,et al.  Imputation of Rare Variants in Next-Generation Association Studies , 2013, Human Heredity.

[24]  Marit Holden,et al.  GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies , 2008, Bioinform..

[25]  M Reilly,et al.  Optimal sampling strategies for two-stage studies. , 1996, American journal of epidemiology.

[26]  Bryan Langholz,et al.  Statistical Methods for Analysis of Radiation Effects with Tumor and Dose Location‐Specific Information with Application to the WECARE Study of Asynchronous Contralateral Breast Cancer , 2009, Biometrics.

[27]  Marylyn D. Ritchie,et al.  Pacific Symposium on Biocomputing 14:368-379 (2009) BIOFILTER: A KNOWLEDGE-INTEGRATION SYSTEM FOR THE MULTI-LOCUS ANALYSIS OF GENOME-WIDE ASSOCIATION STUDIES * , 2022 .

[28]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[29]  A. Børresen-Dale,et al.  Variants in the ATM gene associated with a reduced risk of contralateral breast cancer. , 2008, Cancer research.

[30]  K. Frazer,et al.  Common vs. rare allele hypotheses for complex diseases. , 2009, Current opinion in genetics & development.

[31]  Scott T. Weiss,et al.  Screening and Replication using the Same Data Set: Testing Strategies for Family-Based Studies in which All Probands Are Affected , 2008, PLoS genetics.

[32]  P. Brennan,et al.  Inherited Predisposition of Lung Cancer: A Hierarchical Modeling Approach to DNA Repair and Cell Cycle Control Pathways , 2007, Cancer Epidemiology Biomarkers & Prevention.

[33]  T. Bevers,et al.  Risk of asynchronous contralateral breast cancer in noncarriers of BRCA1 and BRCA2 mutations with a family history of breast cancer: A report from the women's environmental cancer and radiation epidemiology study , 2013 .

[34]  J. Marchini,et al.  Fast and accurate genotype imputation in genome-wide association studies through pre-phasing , 2012, Nature Genetics.

[35]  D. Serie,et al.  Colorectal Cancer Linkage on Chromosomes 4q21, 8q13, 12q24, and 15q22 , 2012, PloS one.

[36]  D. Goldstein,et al.  Uncovering the roles of rare variants in common disease through whole-genome sequencing , 2010, Nature Reviews Genetics.

[37]  Christoph Lange,et al.  Using the noninformative families in family-based association tests: a powerful new testing strategy. , 2003, American journal of human genetics.

[38]  Hoda Anton-Culver,et al.  Population-based study of the risk of second primary contralateral breast cancer associated with carrying a mutation in BRCA1 or BRCA2. , 2010, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[39]  Min A. Jhun,et al.  SNP Set Association Analysis for Familial Data , 2012, Genetic epidemiology.

[40]  C. Carlson,et al.  Principles for the post-GWAS functional characterization of cancer risk loci , 2011, Nature Genetics.

[41]  W. Bodmer,et al.  Common and rare variants in multifactorial susceptibility to common diseases , 2008, Nature Genetics.

[42]  Norman E. Breslow,et al.  Large Sample Theory for Semiparametric Regression Models with Two-Phase, Outcome Dependent Sampling , 2003 .

[43]  Toby Johnson,et al.  Bayesian method for gene detection and mapping, using a case and control design and DNA pooling. , 2005, Biostatistics.

[44]  John D Potter,et al.  Colon Cancer Family Registry: An International Resource for Studies of the Genetic Epidemiology of Colon Cancer , 2007, Cancer Epidemiology Biomarkers & Prevention.

[45]  James W Baurley,et al.  Hierarchical Bayes prioritization of marker associations from a genome‐wide association scan for further investigation , 2007, Genetic epidemiology.

[46]  R. Elston,et al.  Optimal two‐stage genotyping in population‐based association studies , 2003, Genetic epidemiology.

[47]  Rachel Karchin,et al.  Next generation tools for the annotation of human SNPs , 2009, Briefings Bioinform..

[48]  S. Browning,et al.  A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic , 2009, PLoS genetics.

[49]  Juan Pablo Lewinger,et al.  Methodological Issues in Multistage Genome-wide Association Studies. , 2009, Statistical science : a review journal of the Institute of Mathematical Statistics.

[50]  Daniel J Schaid,et al.  Multiple Genetic Variant Association Testing by Collapsing and Kernel Methods With Pedigree or Population Structured Data , 2013, Genetic epidemiology.

[51]  G. Abecasis,et al.  Optimal designs for two‐stage genome‐wide association studies , 2007, Genetic epidemiology.

[52]  John S Witte,et al.  Using hierarchical modeling in genetic association studies with multiple markers: application to a case-control study of bladder cancer. , 2004, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[53]  N E Breslow,et al.  Weighted likelihood, pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. , 1997, Statistics in medicine.

[54]  Fan Yang,et al.  Two-Stage Design of Sequencing Studies for Testing Association with Rare Variants , 2011, Human Heredity.

[55]  R. Elston,et al.  Detecting rare and common variants for complex traits: sibpair and odds ratio weighted sum statistics (SPWSS, ORWSS) , 2011, Genetic epidemiology.

[56]  Murim Choi,et al.  On optimal pooling designs to identify rare variants through massive resequencing , 2011, Genetic epidemiology.

[57]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[58]  P. V. van Diest,et al.  BRCA1 and BRCA2 germline mutation analysis in the Indonesian population , 2007, Breast Cancer Research and Treatment.

[59]  Gang Shi,et al.  Optimum designs for next‐generation sequencing to discover rare variants for common complex disease , 2011, Genetic epidemiology.

[60]  F. Scholz Maximum Likelihood Estimation , 2006 .

[61]  R. Elston,et al.  Detecting rare variants for complex traits using family and unrelated data , 2010, Genetic epidemiology.

[62]  Jana Marie Schwarz,et al.  MutationTaster evaluates disease-causing potential of sequence alterations , 2010, Nature Methods.

[63]  A DNA Pooling Strategy for Family-Based Association Studies , 2005, Cancer Epidemiology Biomarkers & Prevention.

[64]  M. Daly,et al.  Rapid multipoint linkage analysis of recessive traits in nuclear families, including homozygosity mapping. , 1995, American journal of human genetics.

[65]  G. Parmigiani,et al.  SNP Prioritization Using a Bayesian Probability of Association , 2013, Genetic epidemiology.

[66]  Joshua T. Burdick,et al.  In silico method for inferring genotypes in pedigrees , 2006, Nature Genetics.

[67]  C. Begg,et al.  Two‐Stage Designs for Gene–Disease Association Studies with Sample Size Constraints , 2004, Biometrics.

[68]  E. Wijsman,et al.  GIGI: an approach to effective imputation of dense genotypes on large pedigrees. , 2013, American journal of human genetics.

[69]  Margaret S. Pepe,et al.  A mean score method for missing and auxiliary covariate data in regression models , 1995 .

[70]  J. Boice,et al.  Dose to the contralateral breast from radiotherapy and risk of second primary breast cancer in the WECARE study. , 2008, International journal of radiation oncology, biology, physics.

[71]  G. Parmigiani,et al.  Importance of Different Types of Prior Knowledge in Selecting Genome‐Wide Findings for Follow‐Up , 2013, Genetic epidemiology.

[72]  Yun Li,et al.  Imputation of coding variants in African Americans: better performance using data from the exome sequencing project , 2013, Bioinform..

[73]  Peter Kraft,et al.  Re-Ranking Sequencing Variants in the Post-GWAS Era for Accurate Causal Variant Identification , 2013, PLoS genetics.

[74]  N E Breslow,et al.  Logistic regression for stratified case-control studies. , 1988, Biometrics.

[75]  C I Amos,et al.  Evolutionary evidence of the effect of rare variants on disease etiology , 2011, Clinical genetics.

[76]  Alice S. Whittemore,et al.  A Bayesian False Discovery Rate for Multiple Testing , 2007 .

[77]  Yihong Zhao,et al.  Optimal DNA Pooling-Based Two-Stage Designs in Case-Control Association Studies , 2008, Human Heredity.

[78]  Bryan Langholz,et al.  Study design: Evaluating gene–environment interactions in the etiology of breast cancer – the WECARE study , 2004, Breast Cancer Research.

[79]  Timothy W. Yu,et al.  Whole-Exome Sequencing and Homozygosity Analysis Implicate Depolarization-Regulated Neuronal Genes in Autism , 2012, PLoS genetics.

[80]  P. Visscher,et al.  The value of relatives with phenotypes but missing genotypes in association studies for quantitative traits , 2006, Genetic epidemiology.

[81]  A. Børresen-Dale,et al.  Characterization of BRCA1 and BRCA2 deleterious mutations and variants of unknown clinical significance in unilateral and bilateral breast cancer: the WECARE study , 2010, Human mutation.

[82]  D. Thomas,et al.  Two‐Stage sampling designs for gene association studies , 2004, Genetic epidemiology.

[83]  Xiaofeng Zhu,et al.  Detecting rare variants. , 2012, Methods in molecular biology.

[84]  G. Abecasis,et al.  Genotype imputation. , 2009, Annual review of genomics and human genetics.

[85]  N. Cox,et al.  Trait-Associated SNPs Are More Likely to Be eQTLs: Annotation to Enhance Discovery from GWAS , 2010, PLoS genetics.

[86]  Thomas Lumley,et al.  Improved Horvitz–Thompson Estimation of Model Parameters from Two-phase Stratified Samples: Applications in Epidemiology , 2009, Statistics in biosciences.

[87]  Daniel O. Stram,et al.  Modeling and E-M Estimation of Haplotype-Specific Relative Risks from Genotype Data for a Case-Control Study of Unrelated Individuals , 2003, Human Heredity.

[88]  J. Witte,et al.  Asymptotic bias and efficiency in case-control studies of candidate genes and gene-environment interactions: basic family designs. , 1999, American journal of epidemiology.

[89]  G. Parmigiani,et al.  Missense mutations in disease genes: a Bayesian approach to evaluate causality. , 1998, American journal of human genetics.

[90]  A. Scott,et al.  On the Breslow–Holubkov estimator , 2007, Lifetime data analysis.

[91]  Li Li,et al.  Incorporating Prior Biologic Information for High-Dimensional Rare Variant Association Studies , 2013, Human Heredity.

[92]  C. Begg,et al.  Assessment of rare BRCA1 and BRCA2 variants of unknown significance using hierarchical modeling , 2011, Genetic epidemiology.

[93]  I. Ionita-Laza,et al.  Study Designs for Identification of Rare Disease Variants in Complex Diseases: The Utility of Family-Based Designs , 2011, Genetics.

[94]  J. Meigs,et al.  Sequence Kernel Association Test for Quantitative Traits in Family Samples , 2013, Genetic epidemiology.

[95]  D. Chasman On the utility of gene set methods in genomewide association studies of quantitative traits , 2008, Genetic epidemiology.

[96]  Eleftheria Zeggini,et al.  Rare variant association analysis methods for complex traits. , 2010, Annual review of genetics.

[97]  J. Witte,et al.  Hierarchical modeling of linkage disequilibrium: genetic structure and spatial relations. , 2003, American journal of human genetics.

[98]  Jennifer D. Brooks,et al.  Risk of asynchronous contralateral breast cancer in noncarriers of BRCA1 and BRCA2 mutations with a family history of breast cancer: a report from the Women's Environmental Cancer and Radiation Epidemiology Study. , 2013, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[99]  David Steel Multistage Sampling , 2011, International Encyclopedia of Statistical Science.

[100]  P. Visscher,et al.  Highly cost-efficient genome-wide association studies using DNA pools and dense SNP arrays , 2008, Nucleic acids research.

[101]  Kathryn Roeder,et al.  Testing for an Unusual Distribution of Rare Variants , 2011, PLoS genetics.

[102]  J Halpern,et al.  Multi-stage sampling in genetic epidemiology. , 1997, Statistics in medicine.

[103]  Michael Krawczak,et al.  The human gene mutation database , 1998, Nucleic Acids Res..

[104]  C. Begg,et al.  Population-Based Study of the Risk of Second Primary Contralateral Breast Cancer Associated With Carrying a Mutation in BRCA 1 or BRCA 2 , 2010 .

[105]  Colin B Begg,et al.  Hierarchical Modeling for Estimating Relative Risks of Rare Genetic Variants: Properties of the Pseudo‐Likelihood Method , 2011, Biometrics.

[106]  K. Shianna,et al.  Exome sequencing followed by large-scale genotyping fails to identify single rare variants of large effect in idiopathic generalized epilepsy. , 2012, American journal of human genetics.

[107]  Larry Wasserman,et al.  Using linkage genome scans to improve power of association in genome scans. , 2006, American journal of human genetics.

[108]  Norman E. Breslow,et al.  Logistic regression for two-stage case-control data , 1988 .

[109]  M. Rieder,et al.  Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. , 2012, American journal of human genetics.

[110]  Daniel J Schaid,et al.  Genomic Similarity and Kernel Methods II: Methods for Genomic Information , 2010, Human Heredity.

[111]  K. Lange,et al.  Prioritizing GWAS results: A review of statistical methods and recommendations for their application. , 2010, American journal of human genetics.

[112]  B W Brown,et al.  Asymptotic power calculations: description, examples, computer code. , 1999, Statistics in medicine.

[113]  Jon Wakefield,et al.  A Bayesian measure of the probability of false discovery in genetic epidemiology studies. , 2007, American journal of human genetics.

[114]  Nilanjan Chatterjee,et al.  Design and analysis of two‐phase studies with binary outcome applied to Wilms tumour prognosis , 1999 .

[115]  C. Begg,et al.  Variation of breast cancer risk among BRCA1/2 carriers. , 2008, JAMA.

[116]  S. Gabriel,et al.  Calibrating a coalescent simulation of human genome sequence variation. , 2005, Genome research.

[117]  Steven G. Self,et al.  Power Calculations for Likelihood Ratio Tests in Generalized Linear Models , 1992 .

[118]  Norman E. Breslow,et al.  Maximum Likelihood Estimation of Logistic Regression Parameters under Two‐phase, Outcome‐dependent Sampling , 1997 .

[119]  Qiuying Sha,et al.  Two-stage association tests for genome-wide association studies based on family data with arbitrary family structure , 2007, European Journal of Human Genetics.

[120]  John P A Ioannidis,et al.  What Should the Genome-wide Significance Threshold Be? Empirical Replication of Borderline Genetic Associations Yfor a Full List of Investigators Offering Data and Clarifications See Acknowledgments , 2022 .

[121]  Lihong Qi,et al.  Pooled versus individual genotyping in a breast cancer genome‐wide association study , 2010, Genetic epidemiology.

[122]  Duncan C Thomas,et al.  Multistage sampling for latent variable models , 2007, Lifetime data analysis.

[123]  Iuliana Ionita-Laza,et al.  Family-based association tests for sequence data, and comparisons with population-based association tests , 2013, European Journal of Human Genetics.

[124]  Matthew R. Nelson,et al.  Comparison of Statistical Tests for Association between Rare Variants and Binary Traits , 2012, PloS one.

[125]  Jon A Wellner,et al.  A Z-theorem with Estimated Nuisance Parameters and Correction Note for 'Weighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression' , 2008, Scandinavian journal of statistics, theory and applications.

[126]  C. Begg,et al.  Contralateral breast cancer after radiotherapy among BRCA1 and BRCA2 mutation carriers: a WECARE study report. , 2013, European journal of cancer.

[127]  K C Cain,et al.  Logistic regression analysis and efficient design for two-stage studies. , 1988, American journal of epidemiology.

[128]  Jennifer D. Brooks,et al.  Variants in activators and downstream targets of ATM, radiation exposure, and contralateral breast cancer risk in the WECARE study , 2012, Human mutation.

[129]  Thomas Lumley,et al.  Using the whole cohort in the analysis of case-cohort data. , 2009, American journal of epidemiology.

[130]  Daniel J Schaid,et al.  Genomic Similarity and Kernel Methods I: Advancements by Building on Mathematical and Statistical Foundations , 2010, Human Heredity.

[131]  G. Abecasis,et al.  Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies , 2006, Nature Genetics.

[132]  R. Elston,et al.  Multistage sampling for genetic studies. , 2007, Annual review of genomics and human genetics.

[133]  Bo Peng,et al.  Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools , 2012, Bioinform..

[134]  I. Pe’er,et al.  Optimal two‐stage genotyping designs for genome‐wide association scans , 2006, Genetic epidemiology.

[135]  David V Conti,et al.  Incorporating model uncertainty in detecting rare variants: the Bayesian risk index , 2011, Genetic epidemiology.

[136]  M. Boehnke,et al.  So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests. , 2007, American journal of human genetics.

[137]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.