Authorship Attribution Using Relative Compression

Authorship attribution is a classical classification problem. We use it here to illustrate the performance of a compression-based measure that relies on the notion of relative compression. Besides comparing with recent approaches that use multiple discriminant analysis and support vector machines, we compare it with the Normalized Conditional Compression Distance (a direct approximation of the Normalized Information Distance) and the popular Normalized Compression Distance. The Normalized Relative Compression (NRC) attained 100% correct classification in the data set used, showing consistency between the compression ratio and the classification performance, a characteristic not always present in other compression-based measures.

[1]  Sanjeev R. Kulkarni,et al.  Universal Divergence Estimation for Finite-Alphabet Sources , 2006, IEEE Transactions on Information Theory.

[2]  Zhou Wang,et al.  Image distortion analysis based on normalized perceptual information distance , 2013, Signal Image Video Process..

[3]  Toshinori Watanabe Toward a compression-based self-organizing recognizer: Preliminary implementation of PRDC-CSOR , 2013, Pattern Recognit. Lett..

[4]  Armando J. Pinho,et al.  On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models , 2011, PloS one.

[5]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[6]  Zhou Wang,et al.  Generic image similarity based on Kolmogorov complexity , 2010, 2010 IEEE International Conference on Image Processing.

[7]  Zaher Dawy,et al.  Implementing the context tree weighting method for content recognition , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[8]  Derek Abbott,et al.  Automated Authorship Attribution Using Advanced Signal Classification Techniques , 2013, PloS one.

[9]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[10]  N. Merhav,et al.  A Measure of Relative Entropy between Individual Sequences with Application to Universal Classification , 1993, Proceedings. IEEE International Symposium on Information Theory.

[11]  Ken Sugawara,et al.  A New Pattern Representation Scheme Using Data Compression , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[13]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[14]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[15]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[16]  Sebastiaan Terwijn,et al.  Nonapproximability of the normalized information distance , 2009, J. Comput. Syst. Sci..