Generalised Sequence Signatures through symbolic clustering

Traditionally sequence motifs and domains are defined such that insertions, deletions and mismatched regions are small compared with matched regions. We introduce an algorithm for the identification of Generalised Sequence Signatures (GSS) that can be composed of windows distributed throughout the sequence. Our approach is based on clustering analysis of recurring subsequences of a predefined length, to which we refer as symbols. Sequences are grouped so as to maximise the number of shared symbols among them. We show that the utilisation of GSS for deriving sequence annotations yields higher confidence values than the usage of other signature recognition approaches.

[1]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[2]  J. Schug,et al.  Predicting gene ontology functions from ProDom and CDD protein domains. , 2002, Genome research.

[3]  David A. Nix,et al.  GATA: a graphic alignment tool for comparative sequence analysis , 2005, BMC Bioinformatics.

[4]  Sarah A. Teichmann,et al.  DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins , 1998, Bioinform..

[5]  Enno Ohlebusch,et al.  Chaining algorithms for multiple genome comparison , 2005, J. Discrete Algorithms.

[6]  Yongqiang Zhang,et al.  SMOTIF: efficient structured pattern and profile motif search , 2006, Algorithms for Molecular Biology.

[7]  Rolf Apweiler,et al.  InterProScan: protein domains identifier , 2005, Nucleic Acids Res..

[8]  Dennis Shasha,et al.  New techniques for extracting features from protein sequences , 2001, IBM Syst. J..

[9]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[10]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[11]  Mark P. Styczynski,et al.  A generic motif discovery algorithm for sequential data. , 2006, Bioinformatics.

[12]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[13]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[14]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[15]  Mark Gerstein,et al.  Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence , 1998, Bioinform..

[16]  Matthias Platzer,et al.  tuple_plot: Fast pairwise nucleotide sequence comparison with noise suppression , 2006, Bioinform..

[17]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[18]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[19]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[20]  Anne M. Denton,et al.  Clustering sequences by overlap , 2009, Int. J. Data Min. Bioinform..

[21]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[22]  ChengXiang Zhai,et al.  Automatic annotation of protein motif function with Gene Ontology terms , 2003, BMC Bioinformatics.

[23]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[24]  J. Maizel,et al.  Enhanced graphic matrix analysis of nucleic acid and protein sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[26]  Yongqiang Zhang,et al.  EXMOTIF: efficient structured motif extraction , 2006, Algorithms for Molecular Biology.

[27]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..

[29]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.

[30]  Hui Fang,et al.  A Study of Statistical Methods for Function Prediction of Protein Motifs , 2004, Applied bioinformatics.

[31]  Rolf Apweiler,et al.  Applications of InterPro in Protein Annotation and Genome Analysis , 2002, Briefings Bioinform..

[32]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[33]  Sébastien Carrère,et al.  The ProDom database of protein domain families: more emphasis on 3D , 2004, Nucleic Acids Res..

[34]  Jérôme Gouzy,et al.  Whole Genome Protein Domain Analysis using a New Method for Domain Clustering , 1999, Comput. Chem..

[35]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[36]  Ori Sasson,et al.  Functional annotation prediction: All for one and one for all , 2006, Protein science : a publication of the Protein Society.

[37]  Boris Hayete,et al.  GOTrees: Predicting GO Associations from Protein Domain Composition Using Decision Trees , 2004, Pacific Symposium on Biocomputing.