A mathematical consideration of the word-composition vector method in comparison of biological sequences

To measure the similarity or dissimilarity between two given biological sequences, several papers proposed metrics based on the "word-composition vector". The essence of these metrics is as follows. First, we count the appearance frequencies of all the K-tuple words throughout each of two given sequences. Then, the two given sequences are transformed into their respective word-composition vectors. Next, the distance metrics, for example the angle between the two vectors, are calculated. A significant issue is to determine the optimal word size K. With a mathematical model of mutational events (including substitutions, insertions, deletions and duplications) that occur in sequences, we analyzed how the angle between the composition vectors depends on the mutational events. We also considered the optimal word size (=resolution) from our original approach. Our results were verified by computational experiments using artificially generated sequences, amino acid sequences of hemoglobin and nucleotide sequences of 16S ribosomal RNA.

[1]  Zu-Guo Yu,et al.  Proper Distance Metrics for Phylogenetic Analysis Using Complete Genomes without Sequence Alignment , 2010, International journal of molecular sciences.

[2]  Sophie Schbath,et al.  An Overview on the Distribution of Word Counts in Markov Chains , 2000, J. Comput. Biol..

[3]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Antonio Restivo,et al.  Distance measures for biological sequences: Some recent approaches , 2008, Int. J. Approx. Reason..

[5]  Alain Guénoche,et al.  Comparison of alignment free string distances for complete genome phylogeny , 2009, Adv. Data Anal. Classif..

[6]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[7]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[8]  A A Zharkikh,et al.  Quick assessment of similarity of two sequences by comparison of their L-tuple frequencies. , 1993, Bio Systems.

[9]  Marin van Heel,et al.  A new family of powerful multivariate statistical sequence analysis techniques. , 1991 .

[10]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[11]  J. Leader,et al.  A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. , 2002, Molecular biology and evolution.

[12]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[13]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[14]  Tiee-Jian Wu,et al.  Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences , 2005, Bioinform..