Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints

In the era of genome-wide association studies (GWAS) and personalized medicine, predicting the impact of single nucleotide polymorphisms (SNPs) in regulatory elements is an important goal. Current approaches to determine the potential of regulatory SNPs depend on inadequate knowledge of cell-specific DNA binding motifs. Here, we present Sasquatch, a new computational approach that uses DNase footprint data to estimate and visualize the effects of noncoding variants on transcription factor binding. Sasquatch performs a comprehensive k-mer-based analysis of DNase footprints to determine any k-mer's potential for protein binding in a specific cell type and how this may be changed by sequence variants. Therefore, Sasquatch uses an unbiased approach, independent of known transcription factor binding sites and motifs. Sasquatch only requires a single DNase-seq data set per cell type, from any genotype, and produces consistent predictions from data generated by different experimental procedures and at different sequence depths. Here we demonstrate the effectiveness of Sasquatch using previously validated functional SNPs and benchmark its performance against existing approaches. Sasquatch is available as a versatile webtool incorporating publicly available data, including the human ENCODE collection. Thus, Sasquatch provides a powerful tool and repository for prioritizing likely regulatory SNPs in the noncoding genome.

[1]  G. Church,et al.  Genomic sequencing. , 1993, Methods in molecular biology.

[2]  Y. Kan,et al.  Cloning of Nrf1, an NF-E2-related transcription factor, by genetic selection in yeast. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[3]  C. Tournamille,et al.  Disruption of a GATA motif in the Duffy gene promoter abolishes erythroid gene expression in Duffy–negative individuals , 1995, Nature Genetics.

[4]  P. Brown,et al.  cJun overexpression in MCF-7 breast cancer cells produces a tumorigenic, invasive and hormone resistant phenotype , 1999, Oncogene.

[5]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[6]  S. Prabhakar,et al.  Annotation of cis-regulatory elements by identification, subclassification, and functional assessment of multispecies conserved sequences. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[7]  L. Kruglyak,et al.  Genetics of global gene expression , 2006, Nature Reviews Genetics.

[8]  Vip Viprakasit,et al.  A Regulatory SNP Causes a Human Genetic Disease by Creating a New Transcriptional Promoter , 2006, Science.

[9]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[10]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[11]  Robert Gentleman,et al.  rtracklayer: an R package for interfacing with genome browsers , 2009, Bioinform..

[12]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[13]  William Stafford Noble,et al.  Global mapping of protein-DNA interactions in vivo by digital genomic footprinting , 2009, Nature Methods.

[14]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[15]  Juan M. Vaquerizas,et al.  A census of human transcription factors: function, expression and evolution , 2009, Nature Reviews Genetics.

[16]  Timothy L Bailey,et al.  A global role for KLF1 in erythropoiesis revealed by ChIP-seq in primary erythroid cells. , 2010, Genome research.

[17]  Shamit Soneji,et al.  Genome-wide identification of TAL1's functional targets: insights into its mechanisms of action in primary erythroid cells. , 2010, Genome research.

[18]  R. Altman,et al.  Cooperative transcription factor associations discovered using regulatory variation , 2011, Proceedings of the National Academy of Sciences.

[19]  Glauber Barbosa de Carvalho,et al.  Duffy blood group system and the malaria adaptation process in humans , 2011 .

[20]  Jacob F. Degner,et al.  Sequence and Chromatin Accessibility Data Accurate Inference of Transcription Factor Binding from Dna Material Supplemental Open Access , 2022 .

[21]  Shane J. Neph,et al.  An expansive human regulatory lexicon encoded in transcription factor footprints , 2012, Nature.

[22]  Wouter de Laat,et al.  Chromatin loops, gene positioning, and gene expression , 2012, Front. Gene..

[23]  Nathan C. Sheffield,et al.  The accessible chromatin landscape of the human genome , 2012, Nature.

[24]  William Stafford Noble,et al.  Sequence and chromatin determinants of cell-type–specific transcription factor binding , 2012, Genome research.

[25]  Christian Gieger,et al.  Seventy-five genetic loci influencing the human red blood cell , 2012, Nature.

[26]  Shane J. Neph,et al.  Systematic Localization of Common Disease-Associated Variation in Regulatory DNA , 2012, Science.

[27]  B. Maher ENCODE: The human encyclopaedia , 2012, Nature.

[28]  G. Natoli,et al.  Noncoding transcription at enhancers: general principles and functional models. , 2012, Annual review of genetics.

[29]  G. Bejerano,et al.  Enhancers: five essential questions , 2013, Nature Reviews Genetics.

[30]  Monika S. Kowalczyk,et al.  Causes and Consequences of Chromatin Variation between Inbred Mice , 2013, PLoS genetics.

[31]  Matthew C. Canver,et al.  An Erythroid Enhancer of BCL11A Subject to Genetic Variation Determines Fetal Hemoglobin Level , 2013, Science.

[32]  R. Sandstrom,et al.  Probing DNA shape and methylation state on a genomic scale with DNase I , 2013, Proceedings of the National Academy of Sciences.

[33]  A. Dunning,et al.  Beyond GWASs: illuminating the dark road from association to function. , 2013, American journal of human genetics.

[34]  Monika S. Kowalczyk,et al.  Chromatin signatures at transcriptional start sites separate two equally populated yet distinct classes of intergenic long noncoding RNAs , 2013, Genome Biology.

[35]  J. Hughes,et al.  Analysis of Sequence Variation Underlying Tissue‐specific Transcription Factor Binding and Gene Expression , 2013, Human mutation.

[36]  Howard Y. Chang,et al.  Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position , 2013, Nature Methods.

[37]  Kai Zhang,et al.  A prostate cancer susceptibility allele at 6q22 increases RFX6 expression by modulating HOXB13 chromatin binding , 2014, Nature Genetics.

[38]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[39]  M. Buck,et al.  Chromatin accessibility: a window into the genome , 2014, Epigenetics & Chromatin.

[40]  Myong-Hee Sung,et al.  DNase footprint signatures are dictated by factor dynamics and DNA sequence. , 2014, Molecular cell.

[41]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[42]  Mark I. McCarthy,et al.  Pancreatic islet enhancer clusters enriched in type 2 diabetes risk–associated variants , 2013, Nature Genetics.

[43]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[44]  Uwe Ohler,et al.  Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection , 2014, Nucleic acids research.

[45]  M. Gobbi,et al.  Analysis of hundreds of cis-regulatory landscapes at high resolution in a single, high-throughput experiment , 2014, Nature Genetics.

[46]  E. Segal,et al.  In pursuit of design principles of regulatory sequences , 2014, Nature Reviews Genetics.

[47]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[48]  Han Xu,et al.  Analysis of optimized DNase-seq reveals intrinsic bias in transcription factor footprint identification , 2013, Nature methods.

[49]  Chen-Yang Shen,et al.  Functional variants at the 21q22.3 locus involved in breast cancer progression identified by screening of genome-wide estrogen response elements , 2014, Breast Cancer Research.

[50]  Benjamin J. Strober,et al.  A method to predict the impact of regulatory variants from DNA sequence , 2015, Nature Genetics.

[51]  Stein Aerts,et al.  Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models , 2015, PLoS Comput. Biol..

[52]  A. Siepel,et al.  Probabilities of Fitness Consequences for Point Mutations Across the Human Genome , 2014, Nature Genetics.

[53]  C. Glass,et al.  Epigenomics: Roadmap for regulation , 2015, Nature.

[54]  Pedro Madrigal On Accounting for Sequence-Specific Bias in Genome-Wide Chromatin Accessibility Experiments: Recent Advances and Contradictions , 2015, Front. Bioeng. Biotechnol..

[55]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[56]  Simon G. Coetzee,et al.  motifbreakR: an R/Bioconductor package for predicting variant effects at transcription factor binding sites , 2015, Bioinform..

[57]  Eric Haugen,et al.  Large-scale identification of sequence variants impacting human transcription factor occupancy in vivo , 2015, Nature Genetics.

[58]  Salam A. Assi,et al.  Wellington-bootstrap: differential DNase-seq footprinting identifies cell-type determining transcription factors , 2015, BMC Genomics.

[59]  Ji Zhang,et al.  GREGOR: evaluating global enrichment of trait-associated variants in epigenomic features using a systematic, data-driven approach , 2015, Bioinform..

[60]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[61]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[62]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[63]  J. Telenius,et al.  Multiplexed analysis of chromosome conformation at vastly improved sensitivity , 2015, Nature Methods.

[64]  Hunter B. Fraser,et al.  Pooled ChIP-Seq Links Variation in Transcription Factor Binding to Complex Disease Risk , 2016, Cell.

[65]  Cynthia A. Kalita,et al.  Which Genetics Variants in DNase-Seq Footprints Are More Likely to Alter Binding? , 2016, PLoS genetics.

[66]  L. Pennacchio,et al.  Genetic dissection of the α-globin super-enhancer in vivo , 2016, Nature Genetics.

[67]  Ge Tan,et al.  TFBSTools: an R/bioconductor package for transcription factor binding site analysis , 2016, Bioinform..

[68]  M. Sung,et al.  Genome-wide footprinting: ready for prime time? , 2016, Nature Methods.

[69]  David J. Arenillas,et al.  JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles , 2015, Nucleic Acids Res..

[70]  David K. Gifford,et al.  GERV: A Statistical Method for Generative Evaluation of Regulatory Variants for Transcription Factor Binding , 2015, bioRxiv.

[71]  Keji Zhao,et al.  Establishing Chromatin Regulatory Landscape during Mouse Preimplantation Development , 2016, Cell.

[72]  Roberto Vera Alvarez,et al.  Quantifying deleterious effects of regulatory variants , 2016, Nucleic acids research.