Evaluating deterministic motif significance measures in protein databases

BackgroundAssessing the outcome of motif mining algorithms is an essential task, as the number of reported motifs can be very large. Significance measures play a central role in automatically ranking those motifs, and therefore alleviating the analysis work. Spotting the most interesting and relevant motifs is then dependent on the choice of the right measures. The combined use of several measures may provide more robust results. However caution has to be taken in order to avoid spurious evaluations.ResultsFrom the set of conducted experiments, it was verified that several of the selected significance measures show a very similar behavior in a wide range of situations therefore providing redundant information. Some measures have proved to be more appropriate to rank highly conserved motifs, while others are more appropriate for weakly conserved ones. Support appears as a very important feature to be considered for correct motif ranking. We observed that not all the measures are suitable for situations with poorly balanced class information, like for instance, when positive data is significantly less than negative data. Finally, a visualization scheme was proposed that, when several measures are applied, enables an easy identification of high scoring motifs.ConclusionIn this work we have surveyed and categorized 14 significance measures for pattern evaluation. Their ability to rank three types of deterministic motifs was evaluated. Measures were applied in different testing conditions, where relations were identified. This study provides some pertinent insights on the choice of the right set of significance measures for the evaluation of deterministic motifs extracted from protein databases.

[1]  Douglas L. Brutlag,et al.  Identification of Protein Motifs Using Conserved Amino Acid Properties and Partitioning Techniques , 1995, ISMB.

[2]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[3]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[4]  Andrea Califano,et al.  Statistical Significance of Patterns in Biosequences , 1998 .

[5]  Ron D. Appel,et al.  ExPASy: the proteomics server for in-depth protein knowledge and analysis , 2003, Nucleic Acids Res..

[6]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[7]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[8]  Padhraic Smyth,et al.  Rule Induction Using Information Theory , 1991, Knowledge Discovery in Databases.

[9]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[10]  Anders Krogh,et al.  Chapter 4 - An introduction to hidden Markov models for biological sequences , 1998 .

[11]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[12]  T. Gibson,et al.  Systematic Discovery of New Recognition Peptides Mediating Protein Interaction Networks , 2005, PLoS biology.

[13]  David A. Hume,et al.  Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics , 2004 .

[14]  Douglas L. Brutlag,et al.  Enumerating and Ranking Discrete Motifs , 1997, ISMB.

[15]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[16]  I. Rigoutsos,et al.  The emergence of pattern discovery techniques in computational biology. , 2000, Metabolic engineering.

[17]  Gerard van den Eijkel Appendix B: information-theoretic tree and rule induction , 2003 .

[18]  Jiawei Han,et al.  Data Mining: Concepts and Techniques, Second Edition , 2006, The Morgan Kaufmann series in data management systems.

[19]  Eleazar Eskin,et al.  Protein Family Classification Using Sparse Markov Transducers , 2000, J. Comput. Biol..

[20]  Paulo J. Azevedo,et al.  Protein Sequence Classification Through Relevant Sequence Mining and Bayes Classifiers , 2005, EPIA.

[21]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[22]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[23]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[24]  Dimitrios I. Fotiadis,et al.  Motif-Based Protein Sequence Classification Using Neural Networks , 2005, J. Comput. Biol..

[25]  B. Houston Encyclopedia of Genetics , 2002 .

[26]  Steve Strand,et al.  Discovering statistics using SPSS, 2nd edition , 2006 .

[27]  Douglas L. Brutlag,et al.  Sequence Motifs: Highly Predictive Features of Protein Function , 2006, Feature Extraction.

[28]  D. Higgins,et al.  Bioinformatics : sequence, structure, and databanks , 2000 .

[29]  Douglas L. Brutlag,et al.  The EMOTIF database , 2001, Nucleic Acids Res..

[30]  Jorja G. Henikoff,et al.  Protein Family Databases , 2001 .

[31]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[32]  Norman Abramson,et al.  Information theory and coding , 1963 .

[33]  Esko Ukkonen,et al.  Discovering Patterns and Subfamilies in Biosequences , 1996, ISMB.

[34]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[35]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[36]  G. K. Sandve,et al.  A survey of motif discovery methods in an integrated framework , 2006, Biology Direct.

[37]  Philip S. Yu,et al.  Infominer: mining surprising periodic patterns , 2001, KDD '01.

[38]  C. Mathew Encyclopedia of genetics, genomics, proteomics and bioinformatics. , 2005 .

[39]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[40]  Nan Li,et al.  Analysis of computational approaches for motif discovery , 2006, Algorithms for Molecular Biology.

[41]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[42]  D. Higgins,et al.  Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[43]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..

[44]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[45]  Lei Shen,et al.  Combining phylogenetic motif discovery and motif clustering to predict co-regulated genes , 2005, Bioinform..

[46]  Douglas L. Brutlag,et al.  Remote homology detection: a motif based approach , 2003, ISMB.

[47]  Teresa K. Attwood,et al.  The PRINTS protein fingerprint database: functional and evolutionary applications , 2004 .

[48]  Andy P. Field,et al.  Discovering Statistics Using SPSS , 2000 .

[49]  S. Henikoff,et al.  Protein family classification based on searching a database of blocks. , 1994, Genomics.

[50]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[51]  Matteo Comin,et al.  Conservative extraction of over-represented extensible motifs , 2005, ISMB.

[52]  Ajay K. Royyuru,et al.  Systematic and automated discovery of patterns in PROSITE families , 2000, RECOMB '00.

[53]  Valerie Guralnik,et al.  A scalable algorithm for clustering protein sequences , 2001, BIOKDD.

[54]  Des Higgins,et al.  Bioinformatics: Sequence, Structure, and Databanks: A Practical Approach , 2000 .

[55]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[56]  Golan Yona,et al.  Modeling protein families using probabilistic suffix trees , 1999, RECOMB.

[57]  Max Bramer,et al.  Using J-pruning to reduce overfitting in classification trees , 2002, Knowl. Based Syst..