A Parametrized Family of Tversky Metrics Connecting the Jaccard Distance to an Analogue of the Normalized Information Distance

Jiménez, Becerra, and Gelbukh (2013) defined a family of “symmetric Tversky ratio models” Sα,β , 0 ≤ α ≤ 1, β > 0. Each function Dα,β = 1− Sα,β is a semimetric on the powerset of a given finite set. We show that Dα,β is a metric if and only if 0 ≤ α ≤ 12 and β ≥ 1/(1−α). This result is formally verified in the Lean proof assistant. The extreme points of this parametrized space of metrics are J1 = D1/2,2, the Jaccard distance, and J∞ = D0,1, an analogue of the normalized information distance of M. Li, Chen, X. Li, Ma, and Vitányi (2004).

[1]  Alexander F. Gelbukh,et al.  SOFTCARDINALITY-CORE: Improving Text Overlap with Distributional Measures for Semantic Textual Similarity , 2013, *SEMEVAL.

[2]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[3]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[4]  A. Tversky Features of Similarity , 1977 .

[5]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[6]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[7]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .

[8]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[9]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[10]  Alexander Kraskov,et al.  Published under the scientific responsability of the EUROPEAN PHYSICAL SOCIETY Incorporating , 2002 .

[11]  Alexander Kraskov,et al.  Hierarchical Clustering Based on Mutual Information , 2003, ArXiv.

[12]  Vorapong Suppakitpaisarn,et al.  Relaxed triangle inequality ratio of the Sørensen-Dice and Tversky indexes , 2018, Theor. Comput. Sci..

[13]  Edward Raff,et al.  An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance , 2017, KDD.

[14]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[15]  Vorapong Suppakitpaisarn,et al.  Semimetric Properties of Sørensen-Dice and Tversky Indexes , 2016, WALCOM.