Clustering Based on Kolmogorov Information

Résumé In this paper we show how to reduce the computational cost of Clustering by Compression, proposed by Cilibrasi & Vitànyi, from O(n) to O(n). To that end, we adopte the Weighted Paired Group Method using Averages (WPGMA) method to the same similarity measure, based on compression, used in Clustering by Compression. Consequently, our proposed approach has easily classified thousands of data, where Cilibrasi & Vitànyi proposed algorithm shows its limits just for a hundred objects. We give also results of experiments.

[1]  Julia Abrahams,et al.  Code and parse tree for lossless source encoding , 2001, Commun. Inf. Syst..

[2]  Rudi Cilibrasi,et al.  Statistical inference through data compression , 2007 .

[3]  Jean-Paul Delahaye,et al.  Complexités : Aux limites des mathématiques et de l'informatique , 2006 .

[4]  Stéphane Guindon,et al.  Méthodes et algorithmes pour l'approche statistique en phylogénie. (Methods and algorithms for a statistical approach in phylogenetics) , 2003 .

[5]  Vincent Levorato,et al.  Contributions à la Modélisation des Réseaux Complexes : Prétopologie et Applications. (Contributions to the Modeling of Complex Networks: Pretopology and Applications) , 2008 .

[6]  Jean-Paul Delahaye,et al.  Towards a stable definition of Kolmogorov-Chaitin complexity , 2008, ArXiv.

[7]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[8]  Elon Portugaly,et al.  Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space , 2008, ISMB.

[9]  Jean-Paul Delahaye,et al.  Transformation distances: a family of dissimilarity measures based on movements of segments , 1999, Bioinform..

[10]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[11]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[12]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[13]  Vincent Levorato,et al.  Classification prétopologique basée sur la complexité de Kolmogorov , 2009, Stud. Inform. Univ..

[14]  Shlomo Moran,et al.  Optimal implementations of UPGMA and other common clustering algorithms , 2007, Inf. Process. Lett..

[15]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[16]  Paul M. B. Vitányi,et al.  A New Quartet Tree Heuristic for Hierarchical Clustering , 2006, Theory of Evolutionary Algorithms.

[17]  M. Salemi,et al.  The phylogenetic handbook : a practical approach to DNA and protein phylogeny , 2003 .

[18]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[19]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.