Analysis of Large Genomic Data in Silico: The EPIC-Norfolk Study of Obesity

In human genetics, large-scale data are now available with advances in genotyping technologies and international collaborative projects. Our ongoing study of obesity involves Affymetrix 500k genechips on approximately 7000 individuals from the European Prospective Investigation of Cancer (EPIC) Norfolk study. Although the scale of our data is well beyond the ability of many software systems, we have successfully performed the analysis using the statistical analysis system (SAS) software. Our implementation trades memory with computing time and requires moderate hardware configuration. By using such an established system, it extends some earlier discussions in a more constructive and accessible way. We report our findings and give some recommendations with SAS. We also compare briefly with alternative implementations. Our work is relevant to researchers conducting analysis of large-scale data in general, and genomewide association studies in particular.

[1]  C. Begg,et al.  Two‐Stage Designs for Gene–Disease Association Studies with Sample Size Constraints , 2004, Biometrics.

[2]  B Langholz,et al.  Cohort studies for characterizing measured genes. , 1999, Journal of the National Cancer Institute. Monographs.

[3]  Margaret A. Pericak-Vance,et al.  Genetic Analysis of Complex Disease , 2006 .

[4]  Francis S. Collins,et al.  Genes, environment and the value of prospective cohort studies , 2006, Nature Reviews Genetics.

[5]  R C Elston,et al.  Genetic mapping of complex traits. , 1999, Statistics in medicine.

[6]  P. Sham,et al.  Faster Haplotype Frequency Estimation Using Unrelated Subjects , 2002, Human Heredity.

[7]  I. Pe’er,et al.  Optimal two‐stage genotyping designs for genome‐wide association scans , 2006, Genetic epidemiology.

[8]  S E Hodge,et al.  Magnitude of type I error when single-locus linkage analysis is maximized over models: a simulation study. , 1997, American journal of human genetics.

[9]  L. Kruglyak Prospects for whole-genome linkage disequilibrium mapping of common disease genes , 1999, Nature Genetics.

[10]  D. Thomas,et al.  Two‐Stage sampling designs for gene association studies , 2004, Genetic epidemiology.

[11]  M. Ehm,et al.  Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus. , 1998, American journal of human genetics.

[12]  M. Weale,et al.  A survey of current software for haplotype phase inference , 2004, Human Genomics.

[13]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[14]  The power of genome‐wide sib pair linkage scans for quantitative trait loci using the new Haseman–Elston regression method , 2000 .

[15]  G. Abecasis,et al.  Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies , 2006, Nature Genetics.

[16]  H. Stefánsson,et al.  Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes , 2006, Nature Genetics.

[17]  R. Elston,et al.  Two‐stage global search designs for linkage analysis using pairs of affected relatives , 1996 .

[18]  R. Elston,et al.  Optimal two‐stage genotyping in population‐based association studies , 2003, Genetic epidemiology.

[19]  J. Witte,et al.  Genetic dissection of complex traits. , 1994, Nature genetics.

[20]  L. Excoffier,et al.  Computer programs for population genetics data analysis: a survival guide , 2006, Nature Reviews Genetics.

[21]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[22]  Jennifer Wessel,et al.  A comprehensive literature review of haplotyping software and methods for use with unrelated individuals , 2005, Human Genomics.

[23]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[24]  G. Zou,et al.  Statistical Methods for the Analysis of Genetic Association Studies , 2006, Annals of human genetics.

[25]  C. Bonaïti‐pellié,et al.  Effects of misspecifying genetic parameters in lod score analysis. , 1986, Biometrics.

[26]  Peter Kraft,et al.  Exploiting Gene-Environment Interaction to Detect Genetic Associations , 2007, Human Heredity.

[27]  F. Dudbridge A survey of current software for linkage analysis , 2003, Human Genomics.

[28]  N. Craddock,et al.  Efficient strategies for genome scanning using maximum-likelihood affected-sib-pair analysis. , 1997, American journal of human genetics.

[29]  D. Duggan,et al.  Recent developments in genomewide association scans: a workshop summary and review. , 2005, American journal of human genetics.

[30]  R. Elston,et al.  One-stage versus two-stage strategies for genome scans. , 2001, Advances in genetics.

[31]  C. Carlson,et al.  Mapping complex disease loci in whole-genome association studies , 2004, Nature.

[32]  Donglin Zeng,et al.  Sample Size/Power Calculation for Case–Cohort Studies , 2004, Biometrics.

[33]  P. Sham,et al.  Model-free linkage analysis using likelihoods. , 1995, American journal of human genetics.

[34]  Qihua Tan,et al.  Integrated analysis of genetic data with R , 2006, Human Genomics.

[35]  N. Breslow,et al.  Statistics in Epidemiology : The Case-Control Study , 2008 .

[36]  P. Sham,et al.  Model-Free Analysis and Permutation Tests for Allelic Associations , 1999, Human Heredity.

[37]  C. Begg,et al.  Two‐Stage Designs for Gene–Disease Association Studies , 2002, Biometrics.

[38]  D. Clayton,et al.  An R Package for Analysis of Whole-Genome Association Studies , 2007, Human Heredity.

[39]  S. W. Guo,et al.  Genetic mapping of complex traits: promises, problems, and prospects. , 2000, Theoretical population biology.

[40]  Qihua Tan,et al.  Genetic dissection of complex traits in silico: Approaches, problems and solutions , 2006 .

[41]  Robert C Elston,et al.  Advances in statistical human genetics over the last 25 years , 2006, Statistics in medicine.

[42]  N Risch,et al.  The Future of Genetic Studies of Complex Human Diseases , 1996, Science.

[43]  Jianfeng Xu,et al.  Positive results in association studies are associated with departure from Hardy-Weinberg equilibrium: hint for genotyping error? , 2002, Human Genetics.

[44]  A. Donner,et al.  The Merits of Testing Hardy‐Weinberg Equilibrium in the Analysis of Unmatched Case‐Control Data: A Cautionary Note , 2006, Annals of human genetics.

[45]  J. Ott,et al.  Complement Factor H Polymorphism in Age-Related Macular Degeneration , 2005, Science.

[46]  D. Lin,et al.  Evaluating statistical significance in two-stage genomewide association studies. , 2006, American journal of human genetics.

[47]  M. Jarvelin,et al.  A Common Variant in the FTO Gene Is Associated with Body Mass Index and Predisposes to Childhood and Adult Obesity , 2007, Science.

[48]  F. Hu,et al.  A Common Genetic Variant Is Associated with Adult and Childhood Obesity , 2006, Science.