ShapeGTB: the role of local DNA shape in prioritization of functional variants in human promoters with machine learning

Motivation The identification of functional sequence variations in regulatory DNA regions is one of the major challenges of modern genetics. Here, we report results of a combined multifactor analysis of properties characterizing functional sequence variants located in promoter regions of genes. Results We demonstrate that GC-content of the local sequence fragments and local DNA shape features play significant role in prioritization of functional variants and outscore features related to histone modifications, transcription factors binding sites, or evolutionary conservation descriptors. Those observations allowed us to build specialized machine learning classifier identifying functional single nucleotide polymorphisms within promoter regions—ShapeGTB. We compared our method with more general tools predicting pathogenicity of all non-coding variants. ShapeGTB outperformed them by a wide margin (average precision 0.93 vs. 0.47–0.55). On the external validation set based on ClinVar database it displayed worse performance but was still competitive with other methods (average precision 0.47 vs. 0.23–0.42). Such results suggest unique characteristics of mutations located within promoter regions and are a promising signal for the development of more accurate variant prioritization tools in the future.

[1]  Quan Li,et al.  InterVar: Clinical Interpretation of Genetic Variants by the 2015 ACMG-AMP Guidelines. , 2017, American journal of human genetics.

[2]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[3]  Lin Yang,et al.  DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale , 2013, Nucleic Acids Res..

[4]  William Stafford Noble,et al.  Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors , 2012, Genome research.

[5]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[6]  Nathan C. Sheffield,et al.  The accessible chromatin landscape of the human genome , 2012, Nature.

[7]  Monique A Ladds,et al.  Seeing It All: Evaluating Supervised Machine Learning Methods for the Classification of Diverse Otariid Behaviours , 2016, PloS one.

[8]  R. Mann,et al.  The role of DNA shape in protein-DNA recognition , 2009, Nature.

[9]  A. Sandelin,et al.  Metazoan promoters: emerging characteristics and insights into transcriptional regulation , 2012, Nature Reviews Genetics.

[10]  Leszek Rychlewski,et al.  A common cis-element in promoters of protein synthesis and cell cycle genes. , 2007, Acta biochimica Polonica.

[11]  Yvan Saeys,et al.  Large-scale structural analysis of the core promoter in mammalian and plant genomes , 2005, Nucleic acids research.

[12]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[13]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[14]  Jason A Greenbaum,et al.  Construction of a genome-scale structural map at single-nucleotide resolution. , 2007, Genome research.

[15]  Bo Peng,et al.  Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools , 2012, Bioinform..

[16]  Juan J de Pablo,et al.  DNA shape dominates sequence affinity in nucleosome formation. , 2014, Physical review letters.

[17]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[18]  R. Rohs,et al.  Structural and energetic origins of sequence-specific DNA bending: Monte Carlo simulations of papillomavirus E2-DNA binding sites. , 2005, Structure.

[19]  Lin Yang,et al.  DNAshapeR: an R/Bioconductor package for DNA shape prediction and feature encoding , 2015, Bioinform..

[20]  Ryan A. Flynn,et al.  A unique chromatin signature uncovers early developmental enhancers in humans , 2011, Nature.

[21]  Manju Bansal,et al.  DNA Free Energy-Based Promoter Prediction and Comparative Analysis of Arabidopsis and Rice Genomes1[C][W][OA] , 2011, Plant Physiology.

[22]  Ricardo Villamarín-Salomón,et al.  ClinVar: public archive of interpretations of clinically relevant variants , 2015, Nucleic Acids Res..

[23]  Histone H3 , 2020, Definitions.

[24]  A. Boyle,et al.  Mining the Unknown: Assigning Function to Noncoding Single Nucleotide Polymorphisms. , 2017, Trends in genetics : TIG.

[25]  G. Christian Overton,et al.  Conformational and physicochemical DNA features specific for transcription factor binding sites , 1999, Bioinform..

[26]  D. Haussler,et al.  Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[27]  K. Tan,et al.  Identifying noncoding risk variants using disease-relevant gene regulatory networks , 2018, Nature Communications.

[28]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[29]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[30]  T. Meehan,et al.  An atlas of active enhancers across human cell types and tissues , 2014, Nature.

[31]  Syed Haider,et al.  Ensembl BioMarts: a hub for data retrieval across taxonomic space , 2011, Database J. Biol. Databases Curation.

[32]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[33]  Serafim Batzoglou,et al.  Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++ , 2010, PLoS Comput. Biol..

[34]  中村 祐輔,et al.  The human gene , 1997 .

[35]  Timothy J. Durham,et al.  Combinatorial Patterning of Chromatin Regulators Uncovered by Genome-wide Location Analysis in Human Cells , 2011, Cell.

[36]  G. Hon,et al.  Predictive chromatin signatures in the mammalian genome. , 2009, Human molecular genetics.

[37]  Faisal Saeed,et al.  Bioactive Molecule Prediction Using Extreme Gradient Boosting , 2016, Molecules.

[38]  Alexander E Vinogradov,et al.  DNA helix: the importance of being AT-rich , 2017, Mammalian Genome.

[39]  Shuigeng Zhou,et al.  A comparison study on feature selection of DNA structural properties for promoter prediction , 2012, BMC Bioinformatics.

[40]  D. Bhattacharyya,et al.  Structural properties of polymeric DNA from molecular dynamics simulations. , 2009, The Journal of chemical physics.

[41]  E. Zeggini,et al.  Functional annotation of non-coding sequence variants , 2014, Nature Methods.

[42]  Jacob F. Degner,et al.  Sequence and Chromatin Accessibility Data Accurate Inference of Transcription Factor Binding from Dna Material Supplemental Open Access , 2022 .

[43]  Xiaohui Xie,et al.  DANN: a deep learning approach for annotating the pathogenicity of genetic variants , 2015, Bioinform..

[44]  Eleanor J. Gardiner,et al.  A structural similarity analysis of double-helical DNA. , 2004, Journal of molecular biology.

[45]  Timothy R. Hughes,et al.  G+C content dominates intrinsic nucleosome occupancy , 2009, BMC Bioinformatics.

[46]  Jerome H Friedman,et al.  Multiple additive regression trees with application in epidemiology , 2003, Statistics in medicine.

[47]  Jan Komorowski,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm486 Data and text mining Monte Carlo , 2022 .

[48]  Manju Bansal,et al.  Structural properties of promoters: similarities and differences between prokaryotes and eukaryotes , 2005, Nucleic acids research.

[49]  Andy Liaw,et al.  Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships , 2016, J. Chem. Inf. Model..

[50]  Francesca Chiaromonte,et al.  Scoring Pairwise Genomic Sequence Alignments , 2001, Pacific Symposium on Biocomputing.

[51]  Nathaniel D. Heintzman,et al.  Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome , 2007, Nature Genetics.

[52]  Stephen C. J. Parker,et al.  Local DNA Topography Correlates with Functional Noncoding Regions of the Human Genome , 2009, Science.

[53]  J. Stamatoyannopoulos,et al.  Chromatin accessibility pre-determines glucocorticoid receptor binding patterns , 2011, Nature Genetics.

[54]  William Stafford Noble,et al.  Nucleosome positioning signals in genomic DNA. , 2007, Genome research.

[55]  Lee E. Edsall,et al.  A map of the cis-regulatory sequences in the mouse genome , 2012, Nature.

[56]  Colin Campbell,et al.  An integrative approach to predicting the functional effects of non-coding and coding sequence variation , 2015, Bioinform..

[57]  P. Tegtmeyer,et al.  The T-antigen-binding domain of the simian virus 40 core origin of replication , 1987, Journal of virology.

[58]  Remo Rohs,et al.  Covariation between homeodomain transcription factors and the shape of their DNA binding sites , 2013, Nucleic acids research.

[59]  D. Goodsell,et al.  Structure of an alternating-B DNA helix and its relationship to A-tract DNA. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[60]  Benjamin J. Strober,et al.  A method to predict the impact of regulatory variants from DNA sequence , 2015, Nature Genetics.

[61]  D. Karolchik,et al.  The UCSC Genome Browser database: 2016 update , 2015, bioRxiv.

[62]  N. Kolchanov,et al.  Genetic basis of olfactory cognition: extremely high level of DNA sequence polymorphism in promoter regions of the human olfactory receptor genes revealed using the 1000 Genomes Project dataset , 2014, Front. Psychol..

[63]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[64]  Kate B. Cook,et al.  Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity , 2014, Cell.

[65]  David Haussler,et al.  ENCODE Data in the UCSC Genome Browser: year 5 update , 2012, Nucleic Acids Res..

[66]  Teresa M Przytycka,et al.  Shapely DNA attracts the right partner , 2015, Proceedings of the National Academy of Sciences.

[67]  W. J. Kent,et al.  The UCSC Genome Browser , 2003, Current protocols in bioinformatics.

[68]  Johannes Korbmacher,et al.  What Are Structural Properties?† , 2018, Philosophia Mathematica.

[69]  Modesto Orozco,et al.  Determining promoter location based on DNA structure first-principles calculations , 2007, Genome Biology.

[70]  P. Stenson,et al.  The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine , 2013, Human Genetics.