Natural similarity measures between position frequency matrices with an application to clustering

MOTIVATION Transcription factors (TFs) play a key role in gene regulation by binding to target sequences. In silico prediction of potential binding of a TF to a binding site is a well-studied problem in computational biology. The binding sites for one TF are represented by a position frequency matrix (PFM). The discovery of new PFMs requires the comparison to known PFMs to avoid redundancies. In general, two PFMs are similar if they occur at overlapping positions under a null model. Still, most existing methods compute similarity according to probabilistic distances of the PFMs. Here we propose a natural similarity measure based on the asymptotic covariance between the number of PFM hits incorporating both strands. Furthermore, we introduce a second measure based on the same idea to cluster a set of the Jaspar PFMs. RESULTS We show that the asymptotic covariance can be efficiently computed by a two dimensional convolution of the score distributions. The asymptotic covariance approach shows strong correlation with simulated data. It outperforms three alternative methods. The Jaspar clustering yields distinct groups of TFs of the same class. Furthermore, a representative PFM is given for each class. In contrast to most other clustering methods, PFMs with low similarity automatically remain singletons. AVAILABILITY A website to compute the similarity and to perform clustering, the source code and Supplementary Material are available at http://mosta.molgen.mpg.de.

[1]  Douglas L. Brutlag,et al.  Fast probabilistic analysis of sequence function using scoring matrices , 2000, Bioinform..

[2]  Boris Lenhard,et al.  Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes , 2004, BMC Genomics.

[3]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[5]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[6]  Michael Q. Zhang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl662 Sequence analysis Computing exact P-values for DNA motifs , 2022 .

[7]  A. Sandelin,et al.  Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. , 2004, Journal of molecular biology.

[8]  Panayiotis V. Benos,et al.  DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering Strategies , 2007, PLoS Comput. Biol..

[9]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[10]  Alexander J. Hartemink,et al.  Sequence features of DNA binding sites reveal structural class of associated transcription factor , 2006, Bioinform..

[11]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[12]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .

[13]  Martin Vingron,et al.  A new statistical model to select target sequences bound by transcription factors. , 2006, Genome informatics. International Conference on Genome Informatics.

[14]  Damian Smedley,et al.  Ensembl 2005 , 2004, Nucleic Acids Res..

[15]  Norman T. J. Bailey Mathematics, statistics, and systems for health. , 1977 .

[16]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[17]  Steven E. Brenner,et al.  WebLogo: A sequence logo generator - eScholarship , 2004 .

[18]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[19]  S. Pietrokovski Searching databases of conserved sequence regions by aligning protein multiple-alignments. , 1996, Nucleic acids research.

[20]  Michael Q. Zhang,et al.  Similarity of position frequency matrices for transcription factor binding sites , 2005, Bioinform..

[21]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[22]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. , 1988, Trends in biochemical sciences.

[23]  Martin Vingron,et al.  T-Reg Comparator: an analysis tool for the comparison of position weight matrices , 2005, Nucleic Acids Res..

[24]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[25]  Szymon M. Kielbasa,et al.  Measuring similarities between transcription factor binding sites , 2005, BMC Bioinformatics.

[26]  Sung-Hou Kim,et al.  Local feature frequency profile: a method to measure structural similarity in proteins. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[27]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[28]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[29]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[30]  MahonyShaun,et al.  Improved detection of DNA motifs using a self-organized clustering of familial binding profiles , 2005 .

[31]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[32]  M Suzuki,et al.  DNA recognition code of transcription factors in the helix-turn-helix, probe helix, hormone receptor, and zinc finger families. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Aaron Golden,et al.  Improved detection of DNA motifs using a self-organized clustering of familial binding profiles , 2005, ISMB.

[34]  Michael Beckstette,et al.  Fast index based algorithms and software for matching position specific scoring matrices , 2006, BMC Bioinformatics.

[35]  Sven Rahmann,et al.  Dynamic Programming Algorithms for Two Statistical Problems in Computational Biology , 2003, WABI.

[36]  Ting Wang,et al.  Combining phylogenetic data with co-regulated genes to identify regulatory motifs , 2003, Bioinform..

[37]  Rodger Staden,et al.  Methods for calculating the probabilities of finding patterns in sequences , 1989, Comput. Appl. Biosci..

[38]  Martin Vingron,et al.  On the Power of Profiles for Transcription Factor Binding Site Detection , 2003, Statistical applications in genetics and molecular biology.

[39]  Bart De Moor,et al.  Computational detection of cis-regulatory modules , 2003, ECCB.

[40]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[41]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[42]  Jean-Michel Claverie,et al.  The statistical significance of nucleotide position-weight matrix matches , 1996, Comput. Appl. Biosci..