Scoring Function for Pattern Discovery Programs Taking Into Account Sequence Diversity

An important problem in sequence analysis is to discover patterns matching subsets of a given set of bio-sequences. When a pattern common to a subset is found, the quality of the match should be evaluated. This paper proposes that an evaluation scheme for measuring the quality of a match between a sequence set and a common pattern should take into account both the strength of the pattern and the diversity of the sequences matched. A pattern matching a diverse set of sequences (i.e., having low degree of similarity) should get a higher score than an equally strong pattern matching a set of more similar sequences. This will avoid sets of very similar sequences from biasing the score of the patterns. Ideally a measure of statistical signiicance of a match taking sequence diversity into account, should be deened. As this is a non-trivial problem, this paper proposes a non-statistical scoring scheme. Ideal requirements for such a scoring scheme are given. It is assumed that the strength of a pattern and the diversity of the sequence set can be evaluated independently, and combined into a total score for the match. We use a restricted class of PROSITE-like patterns, and an earlier reported method for evaluating pattern strength. Two alternative schemes for evaluating the diversity of a set of sequences are proposed: one based on the sum of edge lengths in an estimated phylogenetic tree for the set of sequences, and one based on the minimum weight spanning tree on the graph of all pairwise distances between sequences. For both cases, algorithms and practical application cases are given. The combined measures are shown to have useful properties for a set of test cases.

[1]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[2]  Hamilton O. Smith,et al.  Finding sequence motifs in groups of functionally related proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[3]  D. McConnell,et al.  Characterisation of a repressor gene (xre) and a temperature-sensitive allele from the Bacillus subtilis prophage, PBSX. , 1990, Gene.

[4]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[5]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[6]  R. C. Underwood,et al.  Stochastic context-free grammars for tRNA modeling. , 1994, Nucleic acids research.

[7]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[8]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[9]  A. F. Neuwald,et al.  Detecting patterns in protein sequences. , 1994, Journal of molecular biology.

[10]  A. Bairoch,et al.  PROSITE: recent developments. , 1994, Nucleic acids research.

[11]  D. Higgins,et al.  Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[12]  Esko Ukkonen,et al.  Discovering Patterns and Subfamilies in Biosequences , 1996, ISMB.

[13]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..