An Information Measure for Comparing Top k Lists

Comparing the top k elements between two or more ranked results is a common task in many contexts and settings. A few measures have been proposed to compare top k lists with attractive mathematical properties, but they face a number of pitfalls and shortcomings in practice. This work introduces a new measure to compare any two top k lists based on measuring the information these lists convey. Our method investigates the compressibility of the lists, and the length of the message to encode losslessly the lists gives a natural and robust measure of their variability. This information-theoretic measure objectively reconciles all the main considerations that arise when measuring (dis-)similarity between lists: the extent of their non-overlapping elements, the amount of disarray among overlapping elements, the measurement of displacement of actual ranks (positions) of their overlapping elements. We demonstrate that our measure is intuitively simple and superior to other commonly used measures. To the best of our knowledge, this is the first attempt to address the problem using information compression as its basis.

[1]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences , 1966, JACM.

[2]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[3]  Frank Ruskey,et al.  Ranking and unranking permutations in linear time , 2001, Inf. Process. Lett..

[4]  Cesare Furlanello,et al.  Canberra distance on ranked lists , 2009 .

[5]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[6]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[7]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[8]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[9]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[10]  T. Bayes An essay towards solving a problem in the doctrine of chances , 2003 .

[11]  R.K. Pearson Reciprocal Rank-Based Comparison of Ordered Gene Lists , 2007, 2007 IEEE International Workshop on Genomic Signal Processing and Statistics.

[12]  Judit Bar-Ilan,et al.  Methods for comparing rankings of search engine results , 2005, Comput. Networks.

[13]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[14]  Ronald Fagin,et al.  Comparing Partial Rankings , 2006, SIAM J. Discret. Math..

[15]  Michael G. Schimek,et al.  Package "TopKLists" for Rank-based Genomic Data Integration , 2011 .

[16]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[17]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[18]  G. N. Lance,et al.  Computer Programs for Hierarchical Polythetic Classification ("Similarity Analyses") , 1966, Comput. J..

[19]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[20]  D. H. Lehmer Teaching combinatorial tricks to a computer , 1960 .

[21]  P. Gregersen,et al.  Overlapping Probabilities of Top Ranking Gene Lists, Hypergeometric Distribution, and Stringency of Gene Selection Criterion , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[22]  C. S. Wallace,et al.  Coding Decision Trees , 1993, Machine Learning.

[23]  Cesare Furlanello,et al.  Algebraic Comparison of Partial Lists in Bioinformatics , 2010, PloS one.

[24]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[25]  A. Shiryayev On Tables of Random Numbers , 1993 .

[26]  S. F.R.,et al.  An Essay towards solving a Problem in the Doctrine of Chances . By the late Rev . Mr . Bayes , communicated by Mr . Price , in a letter to , 1999 .