Exploiting Linkage Disequilibrium for Ultrahigh-Dimensional Genome-Wide Data with an Integrated Statistical Approach

Genome-wide data with millions of single-nucleotide polymorphisms (SNPs) can be highly correlated due to linkage disequilibrium (LD). The ultrahigh dimensionality of big data brings unprecedented challenges to statistical modeling such as noise accumulation, the curse of dimensionality, computational burden, spurious correlations, and a processing and storing bottleneck. The traditional statistical approaches lose their power due to p≫n (n is the number of observations and p is the number of SNPs) and the complex correlation structure among SNPs. In this article, we propose an integrated distance correlation ridge regression (DCRR) approach to accommodate the ultrahigh dimensionality, joint polygenic effects of multiple loci, and the complex LD structures. Initially, a distance correlation (DC) screening approach is used to extensively remove noise, after which LD structure is addressed using a ridge penalized multiple logistic regression (LRR) model. The false discovery rate, true positive discovery rate, and computational cost were simultaneously assessed through a large number of simulations. A binary trait of Arabidopsis thaliana, the hypersensitive response to the bacterial elicitor AvrRpm1, was analyzed in 84 inbred lines (28 susceptibilities and 56 resistances) with 216,130 SNPs. Compared to previous SNP discovery methods implemented on the same data set, the DCRR approach successfully detected the causative SNP while dramatically reducing spurious associations and computational time.

[1]  Liping Zhu,et al.  An iterative approach to distance correlation-based sure independence screening† , 2015 .

[2]  Hans A. Kestler,et al.  Proceedings of Reisensburg 2013 , 2015 .

[3]  T. Spector,et al.  Conditional testing of multiple variants associated with bone mineral density in the FLNB gene region suggests that they represent a single association signal , 2013, BMC Genetics.

[4]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[5]  Wei Pan,et al.  Penalized regression and risk prediction in genome‐wide association studies , 2013, Stat. Anal. Data Min..

[6]  P. Waldmann,et al.  Evaluation of the lasso and the elastic net in genome-wide association studies , 2013, Front. Genet..

[7]  Moudud Alam,et al.  A Novel Generalized Ridge Regression Method for Quantitative Genetics , 2013, Genetics.

[8]  Wonsuk Yoo,et al.  A Comparison of Logistic Regression, Logic Regression, Classification Tree, and Random Forests to Identify Effective Gene-Gene and Gene-Environmental Interactions. , 2012, International journal of applied science and technology.

[9]  Runze Li,et al.  Feature Screening via Distance Correlation Learning , 2012, Journal of the American Statistical Association.

[10]  Hugo Y. K. Lam,et al.  Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes , 2012, Cell.

[11]  Yi Li,et al.  Principled sure independence screening for Cox models with ultra-high-dimensional covariates , 2012, J. Multivar. Anal..

[12]  T. Merriman,et al.  Smad2: a candidate gene for the murine autoimmune diabetes locus Idd21.1. , 2011, The Journal of clinical endocrinology and metabolism.

[13]  Erika Cule,et al.  Significance testing in ridge regression for genetic data , 2011, BMC Bioinformatics.

[14]  Annette M. Molinaro,et al.  Power of Data Mining Methods to Detect Genetic Associations and Interactions , 2011, Human Heredity.

[15]  M. Daly,et al.  Candidate gene association study for diabetic retinopathy in persons with type 2 diabetes: the Candidate gene Association Resource (CARe). , 2011, Investigative ophthalmology & visual science.

[16]  Benjamin J. Grady,et al.  The effects of linkage disequilibrium in large scale SNP datasets for MDR , 2011, BioData Mining.

[17]  Kristina M. Visscher,et al.  Would the field of cognitive neuroscience be advanced by sharing functional MRI data? , 2011, BMC medicine.

[18]  David P Bick,et al.  Making a definitive diagnosis: Successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease , 2011, Genetics in Medicine.

[19]  Jun Zhang,et al.  Robust rank correlation based screening , 2010, 1012.4255.

[20]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[21]  Yan Guo,et al.  Molecular genetic studies of gene identification for osteoporosis: the 2009 update. , 2010, Endocrine reviews.

[22]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[23]  Jianqing Fan,et al.  Variance estimation using refitted cross‐validation in ultrahigh dimensional regression , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[24]  Nam-Hee Choi,et al.  Identification of correlated genetic variants jointly associated with rheumatoid arthritis using ridge regression , 2009, BMC proceedings.

[25]  Yang Feng,et al.  Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models , 2009, Journal of the American Statistical Association.

[26]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[27]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[28]  Elena Kulinskaya,et al.  Testing for linkage and Hardy‐Weinberg disequilibrium , 2009, Annals of human genetics.

[29]  Peter Hall,et al.  Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems , 2009 .

[30]  Yeul-Hong Kim,et al.  The genetic polymorphisms of HER-2 and the risk of lung cancer in a Korean population , 2008, BMC Cancer.

[31]  M. Daly,et al.  Genetic Mapping in Human Disease , 2008, Science.

[32]  Bruce S. Weir,et al.  Correlation-Based Inference for Linkage Disequilibrium With Multiple Alleles , 2008, Genetics.

[33]  Montgomery Slatkin,et al.  Linkage disequilibrium — understanding the evolutionary past and mapping the medical future , 2008, Nature Reviews Genetics.

[34]  C. Ulrich,et al.  Genetic susceptibility to cancer: the role of polymorphisms in candidate genes. , 2008, JAMA.

[35]  N. Schork,et al.  Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. , 2008, American journal of human genetics.

[36]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[37]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[38]  M. McCarthy,et al.  Replication of Genome-Wide Association Signals in UK Samples Reveals Risk Loci for Type 2 Diabetes , 2007, Science.

[39]  R. Elston,et al.  Improving power in contrasting linkage-disequilibrium patterns between cases and controls. , 2007, American journal of human genetics.

[40]  D. Gudbjartsson,et al.  Genome-wide association study identifies a second prostate cancer susceptibility variant at 8q24 , 2007, Nature Genetics.

[41]  Wing-Kin Sung,et al.  Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows. , 2007, American journal of human genetics.

[42]  A. Whittemore,et al.  Multiple regions within 8q24 independently affect risk for prostate cancer , 2007, Nature Genetics.

[43]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[44]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[45]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[46]  Runze Li,et al.  Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery , 2006, math/0602133.

[47]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[48]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[49]  H. Cann,et al.  Geographic stratification of linkage disequilibrium: a worldwide population study in a region of chromosome 22 , 2004, Human Genomics.

[50]  J. Peto,et al.  The search for low-penetrance cancer susceptibility alleles , 2004, Oncogene.

[51]  Jonathan C. Cohen,et al.  Multiple Rare Alleles Contribute to Low Plasma Levels of HDL Cholesterol , 2004, Science.

[52]  R. Tibshirani,et al.  Efficient quadratic regularization for expression arrays. , 2004, Biostatistics.

[53]  P. Donnelly,et al.  The Fine-Scale Structure of Recombination Rate Variation in the Human Genome , 2004, Science.

[54]  Dana C Crawford,et al.  Haplotype diversity across 100 candidate genes for inflammation, lipid metabolism, and blood pressure regulation in two populations. , 2004, American journal of human genetics.

[55]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[56]  J. Wall,et al.  Haplotype blocks and linkage disequilibrium in the human genome , 2003, Nature Reviews Genetics.

[57]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[58]  Douglas M. Hawkins,et al.  A faster algorithm for ridge regression of reduced rank data , 2002 .

[59]  Lon R. Cardon,et al.  A first-generation linkage disequilibrium map of human chromosome 22 , 2002, Nature.

[60]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[61]  Jeffrey Ross-Ibarra,et al.  Genetic Data Analysis II. Methods for Discrete Population Genentic Data , 2002 .

[62]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[63]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[64]  J. Pritchard,et al.  Linkage disequilibrium in humans: models and data. , 2001, American journal of human genetics.

[65]  Pardis C Sabeti,et al.  Linkage disequilibrium in the human genome , 2001, Nature.

[66]  J. Todd,et al.  Conditional linkage disequilibrium analysis of a complex disease superlocus, IDDM1 in the HLA region, reveals the presence of independent modifying gene effects influencing the type 1 diabetes risk encoded by the major HLA-DQB1, -DRB1 disease loci. , 2001, Human molecular genetics.

[67]  L. Cardon,et al.  Association study designs for complex diseases , 2001, Nature Reviews Genetics.

[68]  L. Jorde,et al.  Linkage disequilibrium and the search for complex disease genes. , 2000, Genome research.

[69]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[70]  A.M. Halawa,et al.  Tests of regression coefficients under ridge regression models , 2000 .

[71]  Marvin H. J. Gruber Improving Efficiency by Shrinkage: The James--Stein and Ridge Regression Estimators , 1998 .

[72]  N. Risch,et al.  A comparison of linkage disequilibrium measures for fine-scale mapping. , 1995, Genomics.

[73]  J. Dangl,et al.  Structure of the Arabidopsis RPM1 gene enabling dual specificity disease resistance , 1995, Science.

[74]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[75]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[76]  B. Weir,et al.  Genetic Data Analysis: Methods for Discrete Population Genetic Data. , 1991 .

[77]  Bruce S. Weir,et al.  Genetic Data Analysis: Methods for Discrete Population Genetic Data. , 1991 .

[78]  G. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.

[79]  A. Brown,et al.  Sample sizes required to detect linkage disequilibrium between two or three loci. , 1975, Theoretical population biology.

[80]  R. Lewontin The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. , 1964, Genetics.

[81]  P. Armitage Tests for Linear Trends in Proportions and Frequencies , 1955 .

[82]  Tao Jiang,et al.  Detecting genome-wide epistases based on the clustering of relatively frequent items , 2012, Bioinform..

[83]  Sylvia Richardson,et al.  Statistical Applications in Genetics and Molecular Biology Comparing the Characteristics of Gene Expression Profiles Derived by Univariate and Multivariate Classification Methods , 2011 .

[84]  Bjarni J. Vilhjálmsson,et al.  Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines , 2010 .

[85]  Yaping Liu,et al.  Linkage Disequilibrium , 2010 .

[86]  BMC Bioinformatics BioMed Central Methodology article Performance of random forest when SNPs are in linkage disequilibrium , 2009 .

[87]  C. Robert Discussion of "Sure independence screening for ultra-high dimensional feature space" by Fan and Lv. , 2008 .

[88]  Sándor Kemény,et al.  LOGISTIC RIDGE REGRESSION FOR CLINICAL DATA ANALYSIS (A CASE STUDY) , 2006 .

[89]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[90]  J. Lawless,et al.  A simulation study of ridge and other regression estimators , 1976 .

[91]  A. E. Hoerl,et al.  Ridge regression:some simulations , 1975 .

[92]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .