Constructing and Evaluating a Novel Crowdsourcing-based Paraphrased Opinion Spam Dataset

Opinion spam, intentionally written by spammers who do not have actual experience with services or products, has recently become a factor that undermines the credibility of information online. In recent years, studies have attempted to detect opinion spam using machine learning algorithms. However, limitations of gold-standard spam datasets still prove to be a major obstacle in opinion spam research. In this paper, we introduce a novel dataset called Paraphrased OPinion Spam (POPS), which contains a new type of review spam that imitates real human opinions using crowdsourcing. To create such a seemingly truthful review spam dataset, we asked task participants to paraphrase truthful reviews, and include factual information and domain knowledge in their reviews. The classification experiments and semantic analysis results show that our POPS dataset most linguistically and semantically resembles truthful reviews. We believe that our new deceptive opinion spam dataset will help advance opinion spam research.

[1]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[2]  Noah A. Smith,et al.  Probabilistic Frame-Semantic Parsing , 2010, NAACL.

[3]  C. Fillmore FRAME SEMANTICS AND THE NATURE OF LANGUAGE * , 1976 .

[4]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[5]  Yejin Choi,et al.  Syntactic Stylometry for Deception Detection , 2012, ACL.

[6]  Claire Cardie,et al.  In Search of a Gold Standard in Studies of Deception , 2012 .

[7]  Arjun Mukherjee,et al.  What Yelp Fake Review Filter Might Be Doing? , 2013, ICWSM.

[8]  Claire Cardie,et al.  Negative Deceptive Opinion Spam , 2013, NAACL.

[9]  Bing Liu,et al.  Review spam detection , 2007, WWW '07.

[10]  Naomie Salim,et al.  Detection of review spam: A survey , 2015, Expert Syst. Appl..

[11]  Xifeng Yan,et al.  Synthetic review spamming and defense , 2013, WWW.

[12]  Benno Stein,et al.  Identifying featured articles in wikipedia: writing style matters , 2010, WWW '10.

[13]  References , 1971 .

[14]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[15]  Claire Cardie,et al.  Identifying Manipulated Offerings on Review Portals , 2013, EMNLP.

[16]  Claire Cardie,et al.  Estimating the prevalence of deception in online review communities , 2012, WWW.

[17]  Minhwan Yu,et al.  Deep Semantic Frame-Based Deceptive Opinion Spam Analysis , 2015, CIKM.

[18]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[19]  Arjun Mukherjee,et al.  Spotting fake reviewer groups in consumer reviews , 2012, WWW.

[20]  Graeme Hirst,et al.  Detecting Deceptive Opinions with Profile Compatibility , 2013, IJCNLP.

[21]  Daniel Jurafsky,et al.  A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[22]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[23]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[24]  Snehasish Banerjee,et al.  Applauses in hotel reviews: Genuine or deceptive? , 2014, 2014 Science and Information Conference.

[25]  Claire Cardie,et al.  Towards a General Rule for Identifying Deceptive Opinion Spam , 2014, ACL.

[26]  Bing Liu,et al.  Analyzing and Detecting Review Spam , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[27]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.