On Order Equivalences between Distance and Similarity Measures on Sequences and Trees

Both ’distance’ and ’similarity’ measures have been proposed for the comparison of sequences and for the comparison of trees, based on scoring mappings, and the paper concerns the equivalence or otherwise of these. These measures are usually parameterised by an atomic ’cost’ table, defining label-dependent values for swaps, deletions and insertions. We look at the question of whether orderings induced by a ’distance’ measure, with some cost-table, can be dualized by a ’similarity’ measure, with some other cost-table, and vice-versa. Three kinds of orderings are considered: alignment-orderings, for fixed source S and target T , neighbour-orderings, where for a fixed S, varying candidate neighbours Ti are ranked, and pair-orderings, where for varying Si, and varying Tj , the pairings 〈Si,Tj〉 are ranked. We show that (1) alignment-orderings by distance can be dualized by similarity, and vice-versa; (2) neigbour-ordering and pair-ordering by distance can be dualized by similarity; (3) neighbour-ordering and pair-ordering by similarity can sometimes not be dualized by distance. A consequence of this is that there are categorisation and hierarchical clustering outcomes which can be achieved via similarity but not via distance 1 TREE DISTANCE AND SIMILARITY In many pattern-recognition scenarios the data either takes the form of, or can be encoded as, sequences or trees. Accordingly, there has been much work on the definition, implementation and deployment of measures for the comparison of sequences and for the comparison of trees. These measures are sometimes described as ’distances’ and sometimes as ’similarities’. We are concerned in what follows in first distinguishing between these, and then with the question whether orderings induced by a ’distance’ measure can be dualized by a ’similarity’ measure, and vice-versa. To some extent this can be seen as applying the same kind of analysis to sequence and tree comparison measures as has been applied to set and vector comparison measures (Batagelj and Bren, 1995; Omhover et al., 2005; Lesot and Rifqi, 2010). From statements such as the following To compare RNA structures, we need a score system, or alternatively a distance, which measures the similarity (or the difference) between the structures. These two versions of the problem score and distance are equivalent. (Herrbach et al., 2006) which are not uncommon in the literature (Alves et al., 2002; Kondrak, 2003; Bose and van der Aalst, 2009), it would be easy to gain the impression that similarity and distance (on sequences and trees) are straightforwardly interchangeable notions. In section 1.1 several distinct kinds of equivalence are defined. Sections 2, 3.1 and 3.2 then show that while some kinds of equivalence hold, others do not. To begin we need to clarify what we will mean by ’distance’ and ’similarity’ on sequences and trees. Because sequences can be encoded as vertical trees it suffices to give definitions for trees. Tai first proposed a tree-distance measure (Tai, 1979). Where S and T are ordered, labelled trees, a Tai mapping α : S 7→ T is a partial, 1-to-1 function from the nodes of S into the nodes of T , which respects left-to-right order and ancestry1. For the purpose of assigning a score to such a mapping it is convenient to identify three sets: M the (i, j) ∈ α: the ’matches’ and ’swaps’ D the i ∈ S s.t. ∀ j ∈ T,(i, j) 6∈ α: the ’deletions’ I the j ∈ T s.t. ∀i ∈ S,(i, j) 6∈ α: the ’insertions’ 1so if (i, j) and (i′, j′) are in the mapping then (T1) le f t(i, i′) iff le f t( j, j′) and (T2) anc(i, i′) iff anc( j, j′). ThusM just is the mapping, as a set of node pairs, and D and I just the remaining nodes of S and T which are not ’touched’ by the mapping. Let (.)γ give the label of a node and let C∆ be a ’cost’ table, indexed by {λ}∪Σ, where Σ is the alphabet of labels, which assigns ’costs’ to M , D and I according to2: for (i, j) ∈M cost is C∆(iγ, jγ) for i ∈D cost is C∆(iγ,λ) for j ∈ I cost is C∆(λ, jγ) Where α : S 7→ T is any mapping from S to T , define ∆(α : S 7→ T ) by Definition 1 (’distance’ scoring of an alignment).

[1]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[2]  Edson Cáceres,et al.  Parallel dynamic programming for solving the string editing problem on a CGM/BSP , 2002, SPAA '02.

[3]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[4]  Peter A. Spiro,et al.  A Local Alignment Metric for Accelerating Biosequence Database Search , 2004, J. Comput. Biol..

[5]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Alain Denise,et al.  Alignment of RNA secondary structures using a full set of operations , 2006 .

[7]  Richard Johansson,et al.  The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages , 2009, CoNLL Shared Task.

[8]  Maria Rifqi,et al.  Ranking Invariance Based on Similarity Measures in Document Retrieval , 2005, Adaptive Multimedia Retrieval.

[9]  Wil M. P. van der Aalst,et al.  Context Aware Trace Clustering: Towards Improving Process Mining Results , 2009, SDM.

[10]  Grzegorz Kondrak,et al.  Phonetic Alignment and Similarity , 2003, Comput. Humanit..

[11]  Marie-Jeanne Lesot,et al.  Order-Based Equivalence Degrees for Similarity and Distance Measures , 2010, IPMU.

[12]  T. Kuboyama Matching and Learning in Trees , 2007 .

[13]  V. Batagelj,et al.  Comparing resemblance measures , 1995 .

[14]  Aleksandar Stojmirovic,et al.  Geometric Aspects of Biological Sequence Comparison , 2009, J. Comput. Biol..

[15]  Marc Sebban,et al.  Learning probabilistic models of tree edit distance , 2008, Pattern Recognit..

[16]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[17]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[18]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[19]  Bin Ma,et al.  On the similarity metric and the distance metric , 2009, Theor. Comput. Sci..

[20]  Martin Emms Trainable Tree Distance and an Application to Question Categorisation , 2010, KONVENS.

[21]  M. Kendall The treatment of ties in ranking problems. , 1945, Biometrika.

[22]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.