A Novel Alignment-Free Method for Comparing Transcription Factor Binding Site Motifs

Background Transcription factor binding site (TFBS) motifs can be accurately represented by position frequency matrices (PFM) or other equivalent forms. We often need to compare TFBS motifs using their PFMs in order to search for similar motifs in a motif database, or cluster motifs according to their binding preference. The majority of current methods for motif comparison involve a similarity metric for column-to-column comparison and a method to find the optimal position alignment between the two compared motifs. In some applications, alignment-free methods might be preferred; however, few such methods with high accuracy have been described. Methodology/Principal Findings Here we describe a novel alignment-free method for quantifying the similarity of motifs using their PFMs by converting PFMs into k-mer vectors. The motifs could then be compared by measuring the similarity among their corresponding k-mer vectors. Conclusions/Significance We demonstrate that our method in general achieves similar performance or outperforms the existing methods for clustering motifs according to their binding preference and identifying similar motifs of transcription factors of the same family.

[1]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[2]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[3]  Tiee-Jian Wu,et al.  Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition , 2001, Biometrics.

[4]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[5]  Martin Vingron,et al.  Natural similarity measures between position frequency matrices with an application to clustering , 2008, Bioinform..

[6]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[7]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[8]  Michael Q. Zhang,et al.  Similarity of position frequency matrices for transcription factor binding sites , 2005, Bioinform..

[9]  Panayiotis V. Benos,et al.  STAMP: a web tool for exploring DNA-binding motif similarities , 2007, Nucleic Acids Res..

[10]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[11]  M van Heel,et al.  A new family of powerful multivariate statistical sequence analysis techniques. , 1991, Journal of molecular biology.

[12]  Shaoqiang Zhang,et al.  Genome-wide de novo prediction of cis-regulatory binding sites in prokaryotes , 2009, Nucleic acids research.

[13]  Bart De Moor,et al.  Computational detection of cis-regulatory modules , 2003, ECCB.

[14]  Alexander J. Hartemink,et al.  Sequence features of DNA binding sites reveal structural class of associated transcription factor , 2006, Bioinform..

[15]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[16]  D. Latchman Transcription factors: an overview. , 1997, The international journal of biochemistry & cell biology.

[17]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[18]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[19]  Aaron Golden,et al.  Improved detection of DNA motifs using a self-organized clustering of familial binding profiles , 2005, ISMB.

[20]  A. Sandelin,et al.  Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. , 2004, Journal of molecular biology.

[21]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[22]  Ting Wang,et al.  Identifying the conserved network of cis-regulatory sites of a eukaryotic genome. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[23]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[24]  Pasquale Petrilli Classification of protein sequences by their dipeptide composition , 1993, Comput. Appl. Biosci..

[25]  Ting Wang,et al.  Combining phylogenetic data with co-regulated genes to identify regulatory motifs , 2003, Bioinform..

[26]  J. Leader,et al.  A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. , 2002, Molecular biology and evolution.

[27]  Panayiotis V. Benos,et al.  DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering Strategies , 2007, PLoS Comput. Biol..

[28]  E. Koonin,et al.  Computer analysis of transcription regulatory patterns in completely sequenced bacterial genomes. , 1999, Nucleic acids research.

[29]  Ole Winther,et al.  JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update , 2007, Nucleic Acids Res..

[30]  M. Karin,et al.  Too many transcription factors: positive and negative interactions. , 1990, The New biologist.

[31]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[32]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.