Collective Opinion Spam Detection using Active Inference

Opinion spam has become a widespread problem in the online review world, where paid or biased reviewers write fake reviews to elevate or relegate a product (or business) to mislead the consumers for profit or fame. In recent years, opinion spam detection has attracted a lot of attention from both the business and research communities. However, the problem still remains challenging as human labeling is expensive and hence labeled data is scarce, which is needed for supervised learning and evaluation. There exist recent works (e.g., FraudEagle [2], SpEagle [19]) which address the spam detection problem as an unsupervised network inference task on the review network. These methods are also able to incorporate labels (if available), and have been shown to achieve improved performance under the semisupervised inference setting, in which the labels of a random sample of nodes are consumed. In this work, we address the problem of active inference for opinion spam detection. Active inference is the process of carefully selecting a subset of instances (nodes) whose labels are obtained from an oracle to be used during the (network) inference. Our goal is to employ a label acquisition strategy that selects a given number of nodes (a.k.a. the budget) wisely, as opposed to randomly, so as to improve detection performance significantly over the random selection. Our key insight is to select nodes that (i) exhibit high uncertainty, (ii) reside in a dense region, and (iii) are closeby to other uncertain nodes in the network. Based on this insight, we design a utility measure, called Expected UnCertainty Reach (EUCR), and pick the node with the highest EUCR score at every step iteratively. Experiments on two large real-world datasets from Yelp.com show that our method significantly outperforms random sampling as well as other state-of-the-art active inference approaches.

[1]  Yejin Choi,et al.  Distributional Footprints of Deceptive Product Reviews , 2012, ICWSM.

[2]  Manali Sharma,et al.  Most-Surely vs. Least-Surely Uncertain , 2013, 2013 IEEE 13th International Conference on Data Mining.

[3]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[4]  Lise Getoor,et al.  Effective label acquisition for collective classification , 2008, KDD.

[5]  Bin Wu,et al.  Exploiting Network Structure for Active Inference in Collective Classification , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[6]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[7]  Lise Getoor,et al.  Active Learning for Networked Data , 2010, ICML.

[8]  Arjun Mukherjee,et al.  What Yelp Fake Review Filter Might Be Doing? , 2013, ICWSM.

[9]  Kamal Nigamyknigam,et al.  Employing Em in Pool-based Active Learning for Text Classiication , 1998 .

[10]  P. V. Marsden,et al.  Homogeneity in confiding relations , 1988 .

[11]  Philip S. Yu,et al.  Review Graph Based Online Store Review Spammer Detection , 2011, 2011 IEEE 11th International Conference on Data Mining.

[12]  Leman Akoglu,et al.  Discovering Opinion Spammer Groups by Network Footprints , 2015, ECML/PKDD.

[13]  J. Laurie Snell,et al.  Markov Random Fields and Their Applications , 1980 .

[14]  Philip S. Yu,et al.  Review spam detection via temporal pattern discovery , 2012, KDD.

[15]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[16]  Leman Akoglu,et al.  Collective Opinion Spam Detection: Bridging Review Networks and Metadata , 2015, KDD.

[17]  Arjun Mukherjee,et al.  Spotting fake reviewer groups in consumer reviews , 2012, WWW.

[18]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[19]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[20]  Michael Luca Reviews, Reputation, and Revenue: The Case of Yelp.Com , 2016 .

[21]  Sofus A. Macskassy Using graph-based metrics with empirical risk minimization to speed up active learning on networked data , 2009, KDD.

[22]  Ee-Peng Lim,et al.  Finding unusual review patterns using unexpected rules , 2010, CIKM.

[23]  Philip S. Yu,et al.  Active Learning: A Survey , 2014, Data Classification: Algorithms and Applications.

[24]  Christos Faloutsos,et al.  Opinion Fraud Detection in Online Reviews by Network Effects , 2013, ICWSM.

[25]  Peter W. M. Blayney Reviews , 2014 .