Capturing Word Choice Patterns with LDA for Fake Review Detection in Sentiment Analysis

The usefulness of user-generated online reviews is hampered by fake reviews, often produced by clandestinely sponsored reviewers. Detecting fake reviews is a difficult task even for laypeople, and this has also been the case for previous automatic detection approaches, which have only had a limited success. Earlier studies showed that people who tell lies or write deceptive reviews tend to select words unnaturally. We propose a novel approach to detecting fake reviews by applying a topic modeling method based on Latent Dirichlet Allocation (LDA). A unique contribution of this paper is to explicate some latent aspects of fake and truthful reviews by means of "topics" that are not necessarily subject areas but related to the word choice patterns reflecting behavioral and linguistic characteristics of the fake review writers. We constructed a labeled dataset based on Yelp and demonstrated that the proposed approach helps identifying unique aspects of fake and truthful reviews, which has a potential to improving the performance of the fake review detection task. The experimental result shows that our proposed method yields better performance than that of state-of-the-art methods for small size categories in our dataset.

[1]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[2]  Claire Cardie,et al.  TopicSpam: a Topic-Model based approach for spam detection , 2013, ACL.

[3]  Carlo Strapparava,et al.  The Lie Detector: Explorations in the Automatic Recognition of Deceptive Language , 2009, ACL.

[4]  Chris. Drummond,et al.  C 4 . 5 , Class Imbalance , and Cost Sensitivity : Why Under-Sampling beats OverSampling , 2003 .

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[7]  Mark A. deTurck Training observers to detect spontaneous deception: Effects of gender , 1991 .

[8]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  K. Fiedler,et al.  Training lie detectors to use nonverbal cues instead of global heuristics , 1993 .

[11]  W. Levelt,et al.  Speaking: From Intention to Articulation , 1990 .

[12]  Jun Tian,et al.  A Gene Selection Method for Cancer Classification , 2012, Comput. Math. Methods Medicine.

[13]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[14]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[15]  Pamela Meyer,et al.  Liespotting: Proven Techniques to Detect Deception , 2010 .

[16]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[17]  Arjun Mukherjee,et al.  What Yelp Fake Review Filter Might Be Doing? , 2013, ICWSM.