DIVAN: accurate identification of non-coding disease-specific risk variants using multi-omics profiles

Understanding the link between non-coding sequence variants, identified in genome-wide association studies, and the pathophysiology of complex diseases remains challenging due to a lack of annotations in non-coding regions. To overcome this, we developed DIVAN, a novel feature selection and ensemble learning framework, which identifies disease-specific risk variants by leveraging a comprehensive collection of genome-wide epigenomic profiles across cell types and factors, along with other static genomic features. DIVAN accurately and robustly recognizes non-coding disease-specific risk variants under multiple testing scenarios; among all the features, histone marks, especially those marks associated with repressed chromatin, are often more informative than others.

[1]  R. Guigó,et al.  Are splicing mutations the most frequent cause of hereditary disease? , 2005, FEBS letters.

[2]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[3]  Marc Fellous,et al.  Donor splice-site mutations in WT1 are responsible for Frasier syndrome , 1997, Nature Genetics.

[4]  Dirk Van,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[5]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[6]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.

[7]  S. Batzoglou,et al.  Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[8]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[9]  C. Lorson,et al.  A single nucleotide in the SMN gene regulates splicing and is responsible for spinal muscular atrophy. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  V. Iyer,et al.  FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. , 2007, Genome research.

[11]  Zhi-Hua Zhou,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[12]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[13]  T. Cooper,et al.  Pre-mRNA splicing and human disease. , 2003, Genes & development.

[14]  Shane J. Neph,et al.  Systematic Localization of Common Disease-Associated Variation in Regulatory DNA , 2012, Science.

[15]  Li Chen,et al.  traseR: an R package for performing trait-associated SNP enrichment analysis in genomic intervals , 2016, Bioinform..

[16]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[17]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[18]  M. Daly,et al.  Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). , 2005, Genome research.

[19]  C. Cole,et al.  COSMIC: High‐Resolution Cancer Genetics Using the Catalogue of Somatic Mutations in Cancer , 2016, Current protocols in human genetics.

[20]  J. Lupski,et al.  Non-coding genetic variants in human disease. , 2015, Human molecular genetics.

[21]  P. Stenson,et al.  The Human Gene Mutation Database: 2008 update , 2009, Genome Medicine.

[22]  A. Kornblihtt,et al.  Alternative splicing: multiple control mechanisms and involvement in human disease. , 2002, Trends in genetics : TIG.

[23]  M. Copland,et al.  EZH2 in normal and malignant hematopoiesis , 2014, Leukemia.

[24]  P. Farnham,et al.  Making sense of GWAS: using epigenomics and genome engineering to understand the functional relevance of SNPs in non-coding regions of the human genome , 2015, Epigenetics & Chromatin.

[25]  Joseph K. Pickrell Joint analysis of functional genomic data and genome-wide association studies of 18 human traits , 2013, bioRxiv.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[28]  Jianfeng Xu,et al.  Systematic enrichment analysis of potentially functional regions for 103 prostate cancer risk‐associated loci , 2015, The Prostate.

[29]  Kei-Hoi Cheung,et al.  A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data , 2015, Scientific Reports.

[30]  Junfeng Xia,et al.  dbDSM: a manually curated database for deleterious synonymous mutations , 2016, Bioinform..

[31]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[32]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[33]  Wenjie Chen,et al.  GRASP v2.0: an update on the Genome-Wide Repository of Associations between SNPs and phenotypes , 2014, Nucleic Acids Res..

[34]  S. Ellard,et al.  Using SIFT and PolyPhen to predict loss-of-function and gain-of-function mutations. , 2010, Genetic testing and molecular biomarkers.

[35]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[36]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[37]  Z. Weng,et al.  High-Resolution Mapping and Characterization of Open Chromatin across the Genome , 2008, Cell.

[38]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[39]  J. Buxbaum,et al.  A SPECTRAL APPROACH INTEGRATING FUNCTIONAL GENOMIC ANNOTATIONS FOR CODING AND NONCODING VARIANTS , 2015, Nature Genetics.

[40]  Manolis Kellis,et al.  Interpreting non-coding variation in complex disease genetics , 2012, Nature Biotechnology.

[41]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..