Statistical Properties of Similarity Score Functions

In computational biology, a large amount of problems, such as pattern discovery, deals with the comparison of several sequences (of nucleotides, proteins or genes for instance). Very often, algorithms that address this problem use score functions that reflect a notion of similarity between the sequences. The most efficient methods take benefit from theoretical knowledge of the classical behavior of these score functions such as their mean, their variance, and sometime their asymptotic distribution in a given probabilistic model. In this paper, we study a recent family of score functions introduced in Mancheron 2003, which allows to compare two words having the same length. Here, the similarity takes into account all matches and mismatches between two sequences and not only the longest common subsequence as in the case of classical algorithms such as BLAST or FASTA. Based on generating functions, we provide closed formulas for the mean and the variance of these functions in an independent probabilistic model. Finally, we prove that every function in this family asymptotically behaves as a Gaussian random variable.

[1]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[2]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Philippe Flajolet,et al.  Singularity Analysis of Generating Functions , 1990, SIAM J. Discret. Math..

[4]  Brigitte Vallée,et al.  Dynamical Sources in Information Theory : Fundamental intervals and Word Pre xes , 1998 .

[5]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[6]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[7]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[8]  William R. Taylor,et al.  Structure Comparison and Structure Patterns , 2000, J. Comput. Biol..

[9]  Irena Rusu,et al.  Pattern Discovery Allowing Wild-Cards, Substitution Matrices, and Multiple Score Functions , 2003, WABI.

[10]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[11]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[12]  Philippe Flajolet,et al.  ANALYTIC COMBINATORICS — SYMBOLIC COMBINATORICS , 2002 .

[13]  Jérémie Bourdon,et al.  Generalized Pattern Matching Statistics , 2002 .

[14]  Mireille Régnier,et al.  Assessing the Statistical Significance of Overrepresented Oligonucleotides , 2001, WABI.

[15]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[16]  V. Batagelj,et al.  Comparing resemblance measures , 1995 .

[17]  Inge Jonassen,et al.  Efficient discovery of conserved patterns using a pattern graph , 1997, Comput. Appl. Biosci..

[18]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[19]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.