Toward a Language Modeling Approach for Consumer Review Spam Detection

Numerous reports have indicated the severity of fake reviews (i.e., spam) posted to various e-Commerce or opinion sharing Web sites. Nevertheless, very few studies have been conducted to examine the trustworthiness of online consumer reviews because of the lack of an effective computational methodology. Unlike other kinds of Web spam, untruthful reviews could just look like other legitimate reviews (i.e., ham), and so it is difficult to apply any features to distinguish the two classes. One main contribution of our research work is the development of a novel computational methodology to combat online review spam. Our experimental results confirm that the KL divergence and the probabilistic language modeling based computational model is effective for the detection of untruthful reviews. Empowered by the proposed computational methods, our empirical study found that around 2% of the consumer reviews posted to a large e-Commerce site is spam.

[1]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[2]  Pang-Ning Tan,et al.  A co-classification framework for detecting web spam and spammers in social media web sites , 2009, CIKM.

[3]  Jon M. Kleinberg,et al.  WWW 2009 MADRID! Track: Data Mining / Session: Opinions How Opinions are Received by Online Communities: A Case Study on Amazon.com Helpfulness Votes , 2022 .

[4]  Juan Martínez-Romo,et al.  Web spam identification through language model analysis , 2009, AIRWeb '09.

[5]  Bing Liu,et al.  Analyzing and Detecting Review Spam , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  Soo-Min Kim,et al.  Automatically Assessing Review Helpfulness , 2006, EMNLP.

[7]  Ming-Wei Chang,et al.  Partitioned logistic regression for spam filtering , 2008, KDD.

[8]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[9]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[10]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[11]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[12]  Dina Mayzlin,et al.  Promotional Chat on the Internet , 2006 .

[13]  Gordon V. Cormack,et al.  Spam filtering for short messages , 2007, CIKM '07.

[14]  Dawid Weiss,et al.  Exploring linguistic features for web spam detection: a preliminary study , 2008, AIRWeb '08.

[15]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[16]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[17]  Bing Liu,et al.  Review spam detection , 2007, WWW '07.

[18]  Raymond Y. K. Lau,et al.  Automatic Domain Ontology Extraction for Context-Sensitive Opinion Mining , 2009, ICIS.

[19]  Yun Chi,et al.  Detecting splogs via temporal dynamics using self-similarity analysis , 2008, TWEB.

[20]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[21]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[22]  Jian Pei,et al.  Link spam target detection using page farms , 2009, TKDD.

[23]  Marc Najork,et al.  Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[24]  Birger Wernerfelt,et al.  On the Function of Sales Assistance , 1994 .

[25]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[26]  Craig MacDonald,et al.  Overview of the TREC 2007 Blog Track , 2007, TREC.

[27]  Raymond Y. K. Lau,et al.  Leveraging the web context for context-sensitive opinion mining , 2009, 2009 2nd IEEE International Conference on Computer Science and Information Technology.

[28]  Yubo Chen,et al.  Online Consumer Review: Word-of-Mouth as a New Element of Marketing Communication Mix , 2004, Manag. Sci..

[29]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[30]  Raymond Y. K. Lau,et al.  Towards a belief-revision-based adaptive and context-sensitive information retrieval system , 2008, TOIS.

[31]  Raymond Y. K. Lau,et al.  Toward a Fuzzy Domain Ontology Extraction Method for Adaptive e-Learning , 2009, IEEE Transactions on Knowledge and Data Engineering.

[32]  A. Nadas,et al.  Estimation of probabilities in the language model of the IBM speech recognition system , 1984 .

[33]  Craig MacDonald,et al.  Is spam an issue for opinionated blog post search? , 2009, SIGIR.

[34]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.