MISCORE: a new scoring function for characterizing DNA regulatory motifs in promoter sequences

BackgroundComputational approaches for finding DNA regulatory motifs in promoter sequences are useful to biologists in terms of reducing the experimental costs and speeding up the discovery process of de novo binding sites. It is important for rule-based or clustering-based motif searching schemes to effectively and efficiently evaluate the similarity between a k-mer (a k-length subsequence) and a motif model, without assuming the independence of nucleotides in motif models or without employing computationally expensive Markov chain models to estimate the background probabilities of k-mers. Also, it is interesting and beneficial to use a priori knowledge in developing advanced searching tools.ResultsThis paper presents a new scoring function, termed as MISCORE, for functional motif characterization and evaluation. Our MISCORE is free from: (i) any assumption on model dependency; and (ii) the use of Markov chain model for background modeling. It integrates the compositional complexity of motif instances into the function. Performance evaluations with comparison to the well-known Maximum a Posteriori (MAP) score and Information Content (IC) have shown that MISCORE has promising capabilities to separate and recognize functional DNA motifs and its instances from non-functional ones.ConclusionsMISCORE is a fast computational tool for candidate motif characterization, evaluation and selection. It enables to embed priori known motif models for computing motif-to-motif similarity, which is more advantageous than IC and MAP score. In addition to these merits mentioned above, MISCORE can automatically filter out some repetitive k-mers from a motif model due to the introduction of the compositional complexity in the function. Consequently, the merits of our proposed MISCORE in terms of both motif signal modeling power and computational efficiency will make it more applicable in the development of computational motif discovery tools.

[1]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[2]  Hélène Touzet,et al.  Predicting transcription factor binding sites using local over-representation and comparative genomics , 2006, BMC Bioinformatics.

[3]  Julio Collado-Vides,et al.  RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12 , 2004, Nucleic Acids Res..

[4]  Roger E Bumgarner,et al.  From co-expression to co-regulation: how many microarray experiments do we need? , 2004, Genome Biology.

[5]  Zhi Wei,et al.  GAME: detecting cis-regulatory elements using a genetic algorithm , 2006, Bioinform..

[6]  Qing Zhou,et al.  Modeling within-motif dependence for transcription factor binding site predictions , 2004, Bioinform..

[7]  Graziano Pesole,et al.  In silico representation and discovery of transcription factor binding sites , 2004, Briefings Bioinform..

[8]  Dianhui Wang,et al.  SOMEA: self-organizing map based extraction algorithm for DNA motif identification with heterogeneous model , 2011, BMC Bioinformatics.

[9]  P. Bucher,et al.  High-throughput SELEX–SAGE method for quantitative modeling of transcription-factor binding sites , 2002, Nature Biotechnology.

[10]  Z. Weng,et al.  Detection of functional DNA motifs via statistical over-representation. , 2004, Nucleic acids research.

[11]  Steven J. M. Jones,et al.  Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. , 2006, Genome research.

[12]  Bruce A. Craig,et al.  Context dependent models for discovery of transcription factor binding sites , 2006 .

[13]  Martin Vingron,et al.  T-Reg Comparator: an analysis tool for the comparison of position weight matrices , 2005, Nucleic Acids Res..

[14]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[15]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[16]  Megan F. Cole,et al.  Control of Developmental Regulators by Polycomb in Human Embryonic Stem Cells , 2006, Cell.

[17]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[18]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[19]  Bart De Moor,et al.  Computational detection of cis-regulatory modules , 2003, ECCB.

[20]  A. Clark,et al.  Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. , 2002, Molecular biology and evolution.

[21]  D. S. Fields,et al.  Specificity, free energy and information content in protein-DNA interactions. , 1998, Trends in biochemical sciences.

[22]  Steven J. M. Jones,et al.  Prediction of genomic functional elements. , 2006, Annual review of genomics and human genetics.

[23]  M. Eisen All motifs are NOT created equal: structural properties of transcription factor-DNA interactions and the inference of sequence specificity , 2005, Genome Biology.

[24]  S. Pietrokovski Searching databases of conserved sequence regions by aligning protein multiple-alignments. , 1996, Nucleic acids research.

[25]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[26]  Ting Wang,et al.  Combining phylogenetic data with co-regulated genes to identify regulatory motifs , 2003, Bioinform..

[27]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[28]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[29]  Z. Weng,et al.  Finding functional sequence elements by multiple local alignment. , 2004, Nucleic acids research.

[30]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[31]  Yu Liang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm080 Sequence analysis , 2022 .

[32]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[33]  Dianhui Wang,et al.  MISCORE: Mismatch-Based Matrix Similarity Scores for DNA Motif Detection , 2008, ICONIP.

[34]  C. Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Machine Learning.

[35]  Aaron Golden,et al.  Transcription factor binding site identification using the self-organizing map , 2005, Bioinform..

[36]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[37]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[38]  D. Landsman,et al.  Statistical analysis of over-represented words in human promoter sequences. , 2004, Nucleic acids research.

[39]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[40]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[41]  Henry C. M. Leung,et al.  DNA Motif Representation with Nucleotide Dependency , 2008, TCBB.

[42]  Dianhui Wang,et al.  Computational Discovery of Motifs Using Hierarchical Clustering Techniques , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[43]  Sung-Hou Kim,et al.  Local feature frequency profile: a method to measure structural similarity in proteins. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Enrique Blanco,et al.  ABS: a database of Annotated regulatory Binding Sites from orthologous promoters , 2005, Nucleic Acids Res..

[45]  Sven Rahmann,et al.  Efficient exact motif discovery , 2009, Bioinform..

[46]  A. Sandelin,et al.  Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. , 2004, Journal of molecular biology.

[47]  Dianhui Wang,et al.  iGAPK: Improved GAPK Algorithm for Regulatory DNA Motif Discovery , 2010, ICONIP.

[48]  John J. Wyrick,et al.  Genome-wide location and function of DNA binding proteins. , 2000, Science.

[49]  Alan M. Moses,et al.  Position specific variation in the rate of evolution in transcription factor binding sites , 2003, BMC Evolutionary Biology.

[50]  Mark Borodovsky,et al.  Statistical significance in biological sequence analysis , 2006, Briefings Bioinform..