Compression-Based Algorithms for Deception Detection

In this work we extend compression-based algorithms for deception detection in text. In contrast to approaches that rely on theories for deception to identify feature sets, compression automatically identifies the most significant features. We consider two datasets that allow us to explore deception in opinion (content) and deception in identity (stylometry). Our first approach is to use unsupervised clustering based on a normalized compression distance (NCD) between documents. Our second approach is to use Prediction by Partial Matching (PPM) to train a classifier with conditional probabilities from labeled documents, followed by arithmetic coding (AC) to classify an unknown document based on which label gives the best compression. We find a significant dependence of the classifier on the relative volume of training data used to build the conditional probability distributions of the different labels. Methods are demonstrated to overcome the data size-dependence when analytics, not information transfer, is the goal. Our results indicate that deceptive text contains structure statistically distinct from truthful text, and that this structure can be automatically detected using compression-based algorithms.

[1]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[2]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[3]  Dina Mayzlin,et al.  Promotional Reviews: An Empirical Investigation of Online Review Manipulation , 2012 .

[4]  J. Pennebaker,et al.  Lying Words: Predicting Deception from Linguistic Styles , 2003, Personality & social psychology bulletin.

[5]  Claire Cardie,et al.  Estimating the prevalence of deception in online review communities , 2012, WWW.

[6]  Claire Cardie,et al.  Negative Deceptive Opinion Spam , 2013, NAACL.

[7]  B. Depaulo,et al.  Accuracy of Deception Judgments , 2006, Personality and social psychology review : an official journal of the Society for Personality and Social Psychology, Inc.

[8]  D. Biber,et al.  Longman Grammar of Spoken and Written English , 1999 .

[9]  Dongsong Zhang,et al.  A Statistical Language Modeling Approach to Online Deception Detection , 2008, IEEE Transactions on Knowledge and Data Engineering.

[10]  Ko Fujimura,et al.  Tweet classification by data compression , 2011, DETECT '11.

[11]  Rachel Greenstadt,et al.  Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity , 2012, TSEC.

[12]  Jay F. Nunamaker,et al.  Detecting Deception through Linguistic Analysis , 2003, ISI.

[13]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[14]  Rachel Greenstadt,et al.  Practical Attacks Against Authorship Recognition Techniques , 2009, IAAI.

[15]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[16]  Rachel Greenstadt,et al.  Detecting Hoaxes, Frauds, and Deception in Writing Style Online , 2012, 2012 IEEE Symposium on Security and Privacy.

[17]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[18]  Jeffrey T. Hancock,et al.  On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication , 2007 .

[19]  Cindy K. Chung,et al.  The development and psychometric properties of LIWC2007 , 2007 .

[20]  Geoffrey Leech,et al.  Grammatical word class variation within the British National Corpus sampler , 2002 .

[21]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[22]  Ronald de Wolf,et al.  Algorithmic Clustering of Music Based on String Compression , 2004, Computer Music Journal.

[23]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[24]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[25]  Yejin Choi,et al.  Syntactic Stylometry for Deception Detection , 2012, ACL.

[26]  Jay F. Nunamaker,et al.  An exploratory study into deception detection in text-based computer-mediated communication , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[27]  Elad Yom-Tov,et al.  Serial Sharers: Detecting Split Identities of Web Authors , 2007, PAN.

[28]  Ning Wu,et al.  On Compression-Based Text Classification , 2005, ECIR.