An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences

MOTIVATION Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets. RESULTS In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. AVAILABILITY The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.

[1]  P. Mombaerts Seven-transmembrane proteins as odorant and chemosensory receptors. , 1999, Science.

[2]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[3]  C. Ponting,et al.  On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? , 2001, Journal of structural biology.

[4]  Inge Jonassen,et al.  Efficient discovery of conserved patterns using a pattern graph , 1997, Comput. Appl. Biosci..

[5]  Harel Weinstein,et al.  Three-dimensional representations of G protein-coupled receptor structures and mechanisms. , 2002, Methods in enzymology.

[6]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[7]  Guoying Liu,et al.  GPCR-GRAPA-LIB-a refined library of hidden Markov Models for annotating GPCRs , 2003, Bioinform..

[8]  F. Cohen,et al.  An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[9]  Kyuseok Shim,et al.  Mining Sequential Patterns with Regular Expression Constraints , 2002, IEEE Trans. Knowl. Data Eng..

[10]  D. Higgins,et al.  Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[11]  Terri K. Attwood,et al.  PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[12]  M J Sternberg,et al.  Recognition of analogous and homologous protein folds--assessment of prediction success and associated alignment accuracy using empirical substitution matrices. , 1998, Protein engineering.

[13]  G Vriend,et al.  Identification of class-determining residues in G protein-coupled receptors by sequence analysis. , 1997, Receptors & channels.

[14]  Jian Pei,et al.  Mining sequential patterns with constraints in large databases , 2002, CIKM '02.

[15]  Pierre Baldi,et al.  Hidden Markov Models of the G-Protein-Coupled Receptor Family , 1994, J. Comput. Biol..

[16]  C P Ponting,et al.  Sialidase‐like Asp‐boxes: Sequence‐similar structures within different protein folds , 2001, Protein science : a publication of the Protein Society.

[17]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[18]  Kai Ye,et al.  A two‐entropies analysis to identify functional positions in the transmembrane region of class A G protein‐coupled receptors , 2006, Proteins.

[19]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[21]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..