Comparing compression models for authorship attribution.

In this paper we compare different compression models for authorship attribution. To this end, three different types of compressors, Lempel-Ziv type (GZip), block sorting type (BZip) and statistical type (PPM), along with two different similarity measures were considered in our experiments. Besides, two different attribution methods are analyzed in this paper. Through a series of experiments performed on two different databases, we were able to show that all the compressors behave similarly, but the similarity measures can vary considerably depending on the strategy used for authorship attribution. Our results corroborate with the literature in the sense that compression models are a good alternative for authorship attribution surpassing traditional pattern recognition systems based on classifiers and feature extraction.

[1]  Tufan Taş,et al.  Author Identification for Turkish Texts , 2007 .

[2]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[3]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple authorship discrimination: first results , 2003, KDD '03.

[4]  Sujeet Shenoi,et al.  Advances in Digital Forensics III , 2007 .

[5]  Mikhail B. Malyutov,et al.  Authorship attribution of texts: a review , 2005, Electron. Notes Discret. Math..

[6]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[7]  David J. Harper,et al.  Using compression based language models for text categorization. , 2003 .

[8]  Shlomo Argamon,et al.  Author Identification on the Large Scale , 2005 .

[9]  David I. Holmes,et al.  Feature-Finding for Text Classification , 1996 .

[10]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple author discrimination , 2003 .

[11]  Patrick Juola,et al.  Future Trends in Authorship Attribution , 2007, IFIP Int. Conf. Digital Forensics.

[12]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[13]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[14]  Luiz Eduardo Soares de Oliveira,et al.  Selecting syntactic attributes for authorship attribution , 2011, The 2011 International Joint Conference on Neural Networks.

[15]  Yuta Tsuboi,et al.  Authorship identification for heterogeneous documents , 2002 .

[16]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[17]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[18]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[19]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[20]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[21]  Boris Katz,et al.  A Comparative Study of Language Models for Book and Author Recognition , 2005, IJCNLP.

[22]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[23]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[24]  Luiz Eduardo Soares de Oliveira,et al.  Using Conjunctions and Adverbs for Author Verification , 2008, J. Univers. Comput. Sci..

[25]  Ning Wu,et al.  On Compression-Based Text Classification , 2005, ECIR.

[26]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[27]  Dmitry V. Khmelev,et al.  Using Literal and Grammatical Statistics for Authorship Attribution , 2001, Probl. Inf. Transm..

[28]  Paolo Rosso,et al.  Authorship Attribution Using Word Sequences , 2006, CIARP.