Positive Unlabeled Learning for Deceptive Reviews Detection

Deceptive reviews detection has attracted significant attention from both business and research communities. However, due to the difficulty of human labeling needed for supervised learning, the problem remains to be highly challenging. This paper proposed a novel angle to the problem by modeling PU (positive unlabeled) learning. A semi-supervised model, called mixing population and individual property PU learning (MPIPUL), is proposed. Firstly, some reliable negative examples are identified from the unlabeled dataset. Secondly, some representative positive examples and negative examples are generated based on LDA (Latent Dirichlet Allocation). Thirdly, for the remaining unlabeled examples (we call them spy examples), which can not be explicitly identified as positive and negative, two similarity weights are assigned, by which the probability of a spy example belonging to the positive class and the negative class are displayed. Finally, spy examples and their similarity weights are incorporated into SVM (Support Vector Machine) to build an accurate classifier. Experiments on gold-standard dataset demonstrate the effectiveness of MPIPUL which outperforms the state-of-the-art baselines.

[1]  Arjun Mukherjee,et al.  Spotting fake reviewer groups in consumer reviews , 2012, WWW.

[2]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[3]  James Allan,et al.  The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[4]  Yejin Choi,et al.  Syntactic Stylometry for Deception Detection , 2012, ACL.

[5]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[6]  Yi Yang,et al.  Learning to Identify Review Spam , 2011, IJCAI.

[7]  Philip S. Yu,et al.  Positive Unlabeled Learning for Data Stream Classification , 2009, SDM.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[10]  Yejin Choi,et al.  Distributional Footprints of Deceptive Product Reviews , 2012, ICWSM.

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  Ankita Kumar,et al.  Support Kernel Machines for Object Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[14]  Donghong Ji,et al.  Finding Deceptive Opinion Spam by Correcting the Mislabeled Instances , 2015 .

[15]  Fang Wu,et al.  Opinion formation under costly expression , 2010, TIST.

[16]  Wan-Jui Lee,et al.  Kernel Combination Versus Classifier Combination , 2007, MCS.

[17]  Chengqi Zhang,et al.  Similarity-Based Approach for Positive and Unlabeled Learning , 2011, IJCAI.

[18]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[19]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[20]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[21]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[22]  Abhinav Kumar,et al.  Spotting opinion spammers using behavioral footprints , 2013, KDD.

[23]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[24]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[25]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.