Text comparison using data compression

Similarity detection is very important in the field of spam detection, plagiarism detection or topic detection. The main algorithm for comparison of text document is based on the Kolmogorov Complexity, which is one of the perfect measures for computation of the similarity of two strings in defined alphabet. Unfortunately, this measure is incomputable and we must define several approximations which are not metric at all, but in some circumstances are close to this behaviour and may be used in practice. Streszczenie. W artykule omowiono metody rozpoznawania podobienstwa tekstu. Glownie uzywanym algorytmem jest Kolmogotov Complexity. Glownym ograniczeniem jest brak mozliwości dane algorytmu są trudne do dalszego przetwarzania numerycznego – zaproponowano szereg aproksymacji. (Porownanie tekstu przy uzyciu kompresji danych)

[1]  Kimmo Kettunen,et al.  Normalized Compression Distance as automatic MT evaluation metric , 2009 .

[2]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[3]  Jan Platos,et al.  Compression of small text files , 2008, Adv. Eng. Informatics.

[4]  Vojin Senk,et al.  Lossy Lempel-Ziv algorithm for image compression , 2003, 6th International Conference on Telecommunications in Modern Satellite, Cable and Broadcasting Service, 2003. TELSIKS 2003..

[5]  Ronald de Wolf,et al.  Algorithmic Clustering of Music Based on String Compression , 2004, Computer Music Journal.

[6]  Ken Sugawara,et al.  A New Pattern Representation Scheme Using Data Compression , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[8]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[9]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[10]  Václav Snásel,et al.  Fast decoding algorithms for variable-lengths codes , 2012, Inf. Sci..

[11]  Paul M. B. Vitányi,et al.  Universal similarity , 2005, IEEE Information Theory Workshop, 2005..

[12]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[13]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[14]  A. Tversky Features of Similarity , 1977 .

[15]  Darko Kirovski,et al.  Generalized Lempel-Ziv Compression for Audio , 2007, IEEE Trans. Speech Audio Process..

[16]  Armando J. Pinho,et al.  Image similarity using the normalized compression distance based on finite context models , 2011, 2011 18th IEEE International Conference on Image Processing.

[17]  Ana Granados Fontecha Analysis and study on text representation to improve the accuracy of the normalized compression distance , 2012 .

[18]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[19]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[20]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[21]  Mihai Datcu,et al.  A fast compression-based similarity measure with applications to content-based image retrieval , 2012, J. Vis. Commun. Image Represent..

[22]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[23]  Shlomo Dubnov,et al.  Using Machine-Learning Methods for Musical Style Modeling , 2003, Computer.

[24]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[25]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[26]  Timothy C. Bell,et al.  A corpus for the evaluation of lossless compression algorithms , 1997, Proceedings DCC '97. Data Compression Conference.

[27]  Daniela Chudá,et al.  The plagiarism detection by compression method , 2011, CompSysTech '11.

[28]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[29]  Pere-Pau Vázquez,et al.  Using Normalized Compression Distance for image similarity measurement: an experimental study , 2011, The Visual Computer.

[30]  Carla E. Brodley,et al.  Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).