ConBind: motif-aware cross-species alignment for the identification of functional transcription factor binding sites

Eukaryotic gene expression is regulated by transcription factors (TFs) binding to promoter as well as distal enhancers. TFs recognize short, but specific binding sites (TFBSs) that are located within the promoter and enhancer regions. Functionally relevant TFBSs are often highly conserved during evolution leaving a strong phylogenetic signal. While multiple sequence alignment (MSA) is a potent tool to detect the phylogenetic signal, the current MSA implementations are optimized to align the maximum number of identical nucleotides. This approach might result in the omission of conserved motifs that contain interchangeable nucleotides such as the ETS motif (IUPAC code: GGAW). Here, we introduce ConBind, a novel method to enhance alignment of short motifs, even if their mutual sequence similarity is only partial. ConBind improves the identification of conserved TFBSs by improving the alignment accuracy of TFBS families within orthologous DNA sequences. Functional validation of the Gfi1b + 13 enhancer reveals that ConBind identifies additional functionally important ETS binding sites that were missed by all other tested alignment tools. In addition to the analysis of known regulatory regions, our web tool is useful for the analysis of TFBSs on so far unknown DNA regions identified through ChIP-sequencing.

[1]  Berthold Göttgens,et al.  Comparative and functional analyses of LYL1 loci establish marsupial sequences as a model for phylogenetic footprinting. , 2003, Genomics.

[2]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[3]  A. Sharrocks The ETS-domain transcription factor family , 2001, Nature Reviews Molecular Cell Biology.

[4]  B. Göttgens,et al.  The scl +18/19 Stem Cell Enhancer Is Not Required for Hematopoiesis: Identification of a 5′ Bifunctional Hematopoietic-Endothelial Enhancer Bound by Fli-1 and Elf-1 , 2004, Molecular and Cellular Biology.

[5]  Maurits J. J. Dijkstra,et al.  Multiple Sequence Alignment. , 2017, Methods in molecular biology.

[6]  Geoffrey J. Barton,et al.  Jalview Version 2—a multiple sequence alignment editor and analysis workbench , 2009, Bioinform..

[7]  A. Iwama,et al.  Erythroid expansion mediated by the Gfi-1B zinc finger protein: role in normal hematopoiesis. , 2002, Blood.

[8]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[9]  Sudhir Kumar,et al.  Comparative Genomics in Eukaryotes , 2005 .

[10]  T. Möröy,et al.  Gfi1b:green fluorescent protein knock-in mice reveal a dynamic expression pattern of Gfi1b during hematopoiesis that is largely complementary to Gfi1. , 2007, Blood.

[11]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[12]  J. Downing,et al.  Identification of AML-1 and the (8;21) translocation protein (AML-1/ETO) as sequence-specific DNA-binding proteins: the runt homology domain is required for DNA binding and protein-protein interactions , 1993, Molecular and cellular biology.

[13]  S. Moro,et al.  DNA Binding Site Selection of Dimeric and Tetrameric Stat5 Proteins Reveals a Large Repertoire of Divergent Tetrameric Stat5a Binding Sites , 2000, Molecular and Cellular Biology.

[14]  Wyeth W. Wasserman,et al.  ConSite: web-based prediction of regulatory elements using cross-species comparison , 2004, Nucleic Acids Res..

[15]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[16]  Ivan Ovcharenko,et al.  rVISTA 2.0: evolutionary analysis of transcription factor binding sites , 2004, Nucleic Acids Res..

[17]  Jacques van Helden,et al.  RSAT: regulatory sequence analysis tools , 2008, Nucleic Acids Res..

[18]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[19]  G. Woodfield,et al.  Discovery of SMAD4 promoters, transcription factor binding sites and deletions in juvenile polyposis patients , 2011, Nucleic acids research.

[20]  Nicola K. Wilson,et al.  Expression of the leukemia oncogene Lmo2 is controlled by an array of tissue-specific elements dispersed over 100 kb and bound by Tal1/Lmo2, Ets, and Gata factors. , 2009, Blood.

[21]  S. Takahashi,et al.  Transcriptional Regulation of the Murine Acetyl-CoA Synthetase 1 Gene through Multiple Clustered Binding Sites for Sterol Regulatory Element-binding Proteins and a Single Neighboring Site for Sp1* , 2001, The Journal of Biological Chemistry.

[22]  Berthold Göttgens,et al.  Genome-wide identification of cis-regulatory sequences controlling blood and endothelial development. , 2004, Human molecular genetics.

[23]  B. Göttgens,et al.  Establishing the transcriptional programme for blood: the SCL stem cell enhancer is regulated by a multiprotein complex containing Ets and GATA factors , 2002, The EMBO journal.

[24]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[25]  F. Iborra,et al.  GFI1B controls its own expression binding to multiple sites , 2010, Haematologica.

[26]  B. Göttgens,et al.  Integration of Elf-4 into Stem/Progenitor and Erythroid Regulatory Networks through Locus-Wide Chromatin Studies Coupled with In Vivo Functional Validation , 2011, Molecular and Cellular Biology.

[27]  Jaap Heringa,et al.  Two Strategies for Sequence Comparison: Profile-preprocessed and Secondary Structure-induced Multiple Alignment , 1999, Comput. Chem..

[28]  Jaap Heringa,et al.  PRALINE: a versatile multiple sequence alignment toolkit. , 2014, Methods in molecular biology.

[29]  Berthold Göttgens,et al.  Gata2, Fli1, and Scl form a recursively wired gene-regulatory circuit during early hematopoietic development , 2007, Proceedings of the National Academy of Sciences.

[30]  Bart De Moor,et al.  TOUCAN 2: the all-inclusive open source workbench for regulatory sequence analysis , 2005, Nucleic Acids Res..

[31]  W C Black,et al.  The CE Plane , 1990, Medical decision making : an international journal of the Society for Medical Decision Making.

[32]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[33]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[34]  Duen-Yi Huang,et al.  GATA-1 mediates auto-regulation of Gfi-1B transcription in K562 cells , 2005, Nucleic acids research.

[35]  Xiaoyun Xing,et al.  Computational identification and functional validation of regulatory motifs in cartilage-expressed genes. , 2007, Genome research.

[36]  B. Göttgens,et al.  The SCL transcriptional network and BMP signaling pathway interact to regulate RUNX1 activity , 2007, Proceedings of the National Academy of Sciences.

[37]  Berthold Göttgens,et al.  Building an ENCODE-style data compendium on a shoestring , 2013, Nature Methods.

[38]  K. Nicholas,et al.  GeneDoc: Analysis and visualization of genetic variation , 1997 .

[39]  B. Göttgens,et al.  Endoglin expression in the endothelium is regulated by Fli-1, Erg, and Elf-1 acting on the promoter and a -8-kb enhancer. , 2006, Blood.

[40]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[41]  Fabian J Theis,et al.  Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput single-cell gene expression analysis , 2013, Nature Cell Biology.