A method for sorting out the spam from Chinese product reviews

This paper conducts a research on the spam detection in the field of Chinese product reviews. As to useless reviews, the paper uses four important classification features based on questions, hyperlinks and so on to characterize reviews, and then adopts the classification method based on the Logistic regression to detect the useless reviews. As to those untruthful reviews, firstly 2-gram model is proposed to characterize reviews with the consideration of the word order, then the Katz smoothing method is adopted to smooth the model, and lastly the KL divergence is added to detect the untruthful reviews. The experiments have illustrated that those methods put forward in this paper can effectively detect the spam in the field of Chinese product reviews.

[1]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[2]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[3]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[4]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[5]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[6]  Raymond Y. K. Lau,et al.  Toward a Language Modeling Approach for Consumer Review Spam Detection , 2010, 2010 IEEE 7th International Conference on E-Business Engineering.