A non-independent energy-based multiple sequence alignment improves prediction of transcription factor binding sites

MOTIVATION Multiple sequence alignments (MSAs) are usually scored under the assumption that the sequences being aligned have evolved by common descent. Consequently, the differences between sequences reflect the impact of insertions, deletions and mutations. However, non-coding DNA binding sequences, such as transcription factor binding sites (TFBSs), are frequently not related by common descent, and so the existing alignment scoring methods are not well suited for aligning such sequences. RESULTS We present a novel multiple MSA methodology that scores TFBS DNA sequences by including the interdependence of neighboring bases. We introduced two variants supported by different underlying null hypotheses, one statistically and the other thermodynamically generated. We assessed the alignments through their performance in TFBS prediction; both methods show considerable improvements when compared with standard MSA algorithms. Moreover, the thermodynamically generated null hypothesis outperforms the statistical one due to improved stability in the base stacking free energy of the alignment. The thermodynamically generated null hypothesis method can be downloaded from http://sourceforge.net/projects/msa-edna/. CONTACT dov.stekel@nottingham.ac.uk. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  H R Drew,et al.  Principles of sequence-dependent flexure of DNA. , 1986, Journal of molecular biology.

[2]  Lode Wyns,et al.  Align-m-a new algorithm for multiple alignment of highly divergent sequences , 2004, Bioinform..

[3]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[4]  S. Altschul,et al.  Optimal sequence alignment using affine gap costs. , 1986, Bulletin of mathematical biology.

[5]  J. SantaLucia,et al.  Nearest neighbor thermodynamic parameters for internal G.A mismatches in DNA. , 1998, Biochemistry.

[6]  Julio Collado-Vides,et al.  RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units) , 2010, Nucleic Acids Res..

[7]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[8]  John D. Kececioglu,et al.  Multiple alignment by aligning alignments , 2007, ISMB/ECCB.

[9]  D. Stekel,et al.  Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction , 2010, Nucleic acids research.

[10]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[11]  Andrei Alexandrescu,et al.  Modern C++ design: generic programming and design patterns applied , 2001 .

[12]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[13]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Hyrum Carroll,et al.  DNA reference alignment benchmarks based on tertiary structure of encoded proteins , 2007, Bioinform..

[15]  J. Sabina,et al.  Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. , 1999, Journal of molecular biology.

[16]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[17]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[18]  Peer Bork,et al.  SMART: identification and annotation of domains from signalling and extracellular protein sequences , 1999, Nucleic Acids Res..

[19]  Chih Lee,et al.  Searching for transcription factor binding sites in vector spaces , 2012, BMC Bioinformatics.

[20]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[21]  J. SantaLucia,et al.  Thermodynamics and NMR of internal G.T mismatches in DNA. , 1997, Biochemistry.

[22]  J. SantaLucia,et al.  Thermodynamics of internal C.T mismatches in DNA. , 1998, Nucleic acids research.

[23]  Larry N. Singh,et al.  Correlated Evolution of Positions within Mammalian cis Elements , 2013, PloS one.

[24]  J. SantaLucia,et al.  Nearest-neighbor thermodynamics of internal A.C mismatches in DNA: sequence dependence and pH effects. , 1998, Biochemistry.

[25]  J. SantaLucia,et al.  NMR solution structure of a DNA dodecamer containing single G.T mismatches. , 1998, Nucleic acids research.

[26]  Sean R Eddy,et al.  Where did the BLOSUM62 alignment score matrix come from? , 2004, Nature Biotechnology.

[27]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[28]  Burkhard Morgenstern,et al.  Alignment of genomic sequences using DIALIGN. , 2007, Methods in molecular biology.

[29]  G. Stormo,et al.  Determining the specificity of protein–DNA interactions , 2010, Nature Reviews Genetics.

[30]  D. Turner,et al.  Measuring the thermodynamics of RNA secondary structure formation. , 1997, Biopolymers.

[31]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[32]  Julie D Thompson,et al.  Multiple Sequence Alignment Using ClustalW and ClustalX , 2003, Current protocols in bioinformatics.

[33]  Gary D. Stormo,et al.  Modeling the Quantitative Specificity of DNA-Binding Proteins from Example Binding Sites , 2009, PloS one.

[34]  T. D. Schneider,et al.  Quantitative analysis of the relationship between nucleotide sequence and functional activity. , 1986, Nucleic acids research.

[35]  Trevor I. Dix,et al.  Computing Substitution Matrices for Genomic Comparative Analysis , 2009, PAKDD.

[36]  J. Collado-Vides,et al.  On the trail of EHEC/EAEC--unraveling the gene regulatory networks of human pathogenic Escherichia coli bacteria. , 2012, Integrative biology : quantitative biosciences from nano to macro.