论文信息 - On Order Equivalences between Distance and Similarity Measures on Sequences and Trees

On Order Equivalences between Distance and Similarity Measures on Sequences and Trees

Both ’distance’ and ’similarity’ measures have been proposed for the comparison of sequences and for the comparison of trees, based on scoring mappings, and the paper concerns the equivalence or otherwise of these. These measures are usually parameterised by an atomic ’cost’ table, defining label-dependent values for swaps, deletions and insertions. We look at the question of whether orderings induced by a ’distance’ measure, with some cost-table, can be dualized by a ’similarity’ measure, with some other cost-table, and vice-versa. Three kinds of orderings are considered: alignment-orderings, for fixed source S and target T , neighbour-orderings, where for a fixed S, varying candidate neighbours Ti are ranked, and pair-orderings, where for varying Si, and varying Tj , the pairings 〈Si,Tj〉 are ranked. We show that (1) alignment-orderings by distance can be dualized by similarity, and vice-versa; (2) neigbour-ordering and pair-ordering by distance can be dualized by similarity; (3) neighbour-ordering and pair-ordering by similarity can sometimes not be dualized by distance. A consequence of this is that there are categorisation and hierarchical clustering outcomes which can be achieved via similarity but not via distance 1 TREE DISTANCE AND SIMILARITY In many pattern-recognition scenarios the data either takes the form of, or can be encoded as, sequences or trees. Accordingly, there has been much work on the definition, implementation and deployment of measures for the comparison of sequences and for the comparison of trees. These measures are sometimes described as ’distances’ and sometimes as ’similarities’. We are concerned in what follows in first distinguishing between these, and then with the question whether orderings induced by a ’distance’ measure can be dualized by a ’similarity’ measure, and vice-versa. To some extent this can be seen as applying the same kind of analysis to sequence and tree comparison measures as has been applied to set and vector comparison measures (Batagelj and Bren, 1995; Omhover et al., 2005; Lesot and Rifqi, 2010). From statements such as the following To compare RNA structures, we need a score system, or alternatively a distance, which measures the similarity (or the difference) between the structures. These two versions of the problem score and distance are equivalent. (Herrbach et al., 2006) which are not uncommon in the literature (Alves et al., 2002; Kondrak, 2003; Bose and van der Aalst, 2009), it would be easy to gain the impression that similarity and distance (on sequences and trees) are straightforwardly interchangeable notions. In section 1.1 several distinct kinds of equivalence are defined. Sections 2, 3.1 and 3.2 then show that while some kinds of equivalence hold, others do not. To begin we need to clarify what we will mean by ’distance’ and ’similarity’ on sequences and trees. Because sequences can be encoded as vertical trees it suffices to give definitions for trees. Tai first proposed a tree-distance measure (Tai, 1979). Where S and T are ordered, labelled trees, a Tai mapping α : S 7→ T is a partial, 1-to-1 function from the nodes of S into the nodes of T , which respects left-to-right order and ancestry1. For the purpose of assigning a score to such a mapping it is convenient to identify three sets: M the (i, j) ∈ α: the ’matches’ and ’swaps’ D the i ∈ S s.t. ∀ j ∈ T,(i, j) 6∈ α: the ’deletions’ I the j ∈ T s.t. ∀i ∈ S,(i, j) 6∈ α: the ’insertions’ 1so if (i, j) and (i′, j′) are in the mapping then (T1) le f t(i, i′) iff le f t( j, j′) and (T2) anc(i, i′) iff anc( j, j′). ThusM just is the mapping, as a set of node pairs, and D and I just the remaining nodes of S and T which are not ’touched’ by the mapping. Let (.)γ give the label of a node and let C∆ be a ’cost’ table, indexed by {λ}∪Σ, where Σ is the alphabet of labels, which assigns ’costs’ to M , D and I according to2: for (i, j) ∈M cost is C∆(iγ, jγ) for i ∈D cost is C∆(iγ,λ) for j ∈ I cost is C∆(λ, jγ) Where α : S 7→ T is any mapping from S to T , define ∆(α : S 7→ T ) by Definition 1 (’distance’ scoring of an alignment).

Martin Emms | Hector-Hugo Franco-Penya | M. Emms | Hector-Hugo Franco-Penya