A metric measure for weight matrices of variable lengths—with applications to clustering and classification of hidden Markov models

We construct a metric measure among weight matrices that are commonly used in non-interacting statistical physics systems, computational biology problems, as well as in general applications such as hidden Markov models. The metric distance between two weight matrices is obtained via aligning the matrices and thus can be evaluated by dynamic programming. Capable of allowing reverse complements in distance evaluation, this metric accommodates both gapless and gapped alignments between two weight matrices. The distance statistics among random motifs is also studied. We find that the average square distance and its standard error grow with different powers of motif length, and the normalized square distance follows a Gaussian distribution for large motif lengths.

[1]  S M Ulam,et al.  Some ideas and prospects in biomathematics. , 1972, Annual review of biophysics and bioengineering.

[2]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[3]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[4]  E. Wingender,et al.  Recognition of regulatory regions in genomic sequences. , 1994, Journal of biotechnology.

[5]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[6]  W. A. Beyer,et al.  A molecular sequence metric and evolutionary trees , 1974 .