On Clustering Images Using Compression

The need for the ability to cluster unknown data to better understand its relationship to know data is prevalent throughout science. One example is classifying species by genetic structure, another is grouping stars by temperature and size. Besides a better understanding of the data itself or learning about a new unknown object, cluster analysis can help with processing data, data standardization, and outlier detection. These reasons for clustering data have created a wide range of algorithms. Most clustering algorithms are based on known features or expectations, such as the popular partition based, hierarchical, density-based, grid based, and model based algorithms. The choice of algorithm depends on many factors, including the type of data and the reason for clustering, nearly all rely on some known properties of the data being analyzed. Recently, Li et. al. proposed a new “universal” similarity metric [3], this metric needs no prior knowledge about the object. Their similarity metric is based on the Kolmogorov Complexity of objects, the objects minimal description. They use Kolmogorov Complexity to measure objects’ similarity by considering how much information one object contains about the other. While the Kolmogorov Complexity of an object is not computable, in “Clustering by Compression”, Cilibrasi and Vitanyi use common compression algorithms to approximate the universal similarity metric and cluster objects with high success [4]. They show for genomic sequences, music, and literature it is possible to cluster accurately using a metric calculated from the compression of an object. Unfortunately, clustering using compression does not trivially extend to higher dimensions. Informally one can consider the dimension of an object to be the number of directions possible when describing it fully from a starting point. For example consider a novel, we say it is one dimensional as there is a beginning, an end, and the words in between are in a specific order in one direction. A story begins “Once upon a time” and finishes with “The End”, and is read in a linear fashion. However, some data naturally manifests in more than one dimension, and is in fact difficult to express reasonably in fewer dimensions.

[1]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[2]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[3]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[4]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.