Early Detection of Promotion Campaigns in Community Question Answering

As is the case with many social media websites, the Community Question Answering (CQA) portal has become a target for spammers to disseminate promotion information. Previous works mainly focus on identifying low-quality answers or detecting spam information in question-answer (QA) pairs. However, these works suffer from long delay since they all rely on the information of answers or answerers while questions have been displayed on the websites for some time and attracted certain user traffic. As a matter of fact, spammers on CQA platforms also act as questioners and involve promotion information in their questions. So if they can be detected as early as possible, the questions will not appear on the websites and affect legitimate users. In this paper, we design a framework for early detection of promotion campaigns in CQA based on only question information and questioner profile. First, we propose a novel sampling method for identifying the questions that contain promotion information, which compose the positive dataset. We also sample an unlabeled dataset of unsolved questions during a certain period of time. Then, we compare the characteristics of question information and user profiles between the two datasets, which are also used as features in the learning process. Finally, we apply and compare several PU (Positive and Unlabeled examples) learning algorithms to find positive examples in the unlabeled dataset. In our approach, no answer side information is needed, which means that it can detect spamming activities as soon as the question is posted. Experimental results based on about 0.7 million questions derived from a popular Chinese CQA portal indicate that our approach can detect questions related to promotion campaigns as effectively as but more efficiently than the state-of-the-art QA pair level detection methods.

[1]  S. Venkatesh,et al.  The Best Answers? Think Twice: Identifying Commercial Campagins in the CQA Forums , 2015, Journal of Computer Science and Technology.

[2]  Tong Zhang,et al.  Crowd Fraud Detection in Internet Advertising , 2015, WWW.

[3]  Srinivasan Venkatesh,et al.  The best answers? Think twice: Online detection of commercial campaigns in the CQA forums , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[4]  Ee-Peng Lim,et al.  Quality-aware collaborative question answering: methods and evaluation , 2009, WSDM '09.

[5]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[6]  Gilad Mishne,et al.  ClickRank: Learning Session-Context Models to Enrich Web Search Ranking , 2012, TWEB.

[7]  Zhoujun Li,et al.  Question Retrieval with High Quality Answers in Community Question Answering , 2014, CIKM.

[8]  Gang Wang,et al.  Serf and turf: crowdturfing for fun and profit , 2011, WWW.

[9]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[10]  Reza Zafarani,et al.  10 Bits of Surprise: Detecting Malicious Users with Minimum Information , 2015, CIKM.

[11]  Angelos Stavrou,et al.  E-commerce Reputation Manipulation: The Emergence of Reputation-Escalation-as-a-Service , 2015, WWW.

[12]  Xuanjing Huang,et al.  Detecting Spammers in Community Question Answering , 2013, IJCNLP.

[13]  Anna Cinzia Squicciarini,et al.  Uncovering Crowdsourced Manipulation of Online Reviews , 2015, SIGIR.

[14]  Hsin-Hsi Chen,et al.  Opinion Spam Detection in Web Forum: A Real Case Study , 2015, WWW.

[15]  Yiqun Liu,et al.  Microblog Sentiment Analysis with Emoticon Space Model , 2014, Journal of Computer Science and Technology.

[16]  Yiqun Liu,et al.  Detecting Promotion Campaigns in Community Question Answering , 2015, IJCAI.

[17]  Sheizaf Rafaeli,et al.  Predictors of answer quality in online Q&A sites , 2008, CHI.

[18]  Michael R. Lyu,et al.  Analyzing and predicting question quality in community question answering services , 2012, WWW.

[19]  Jeffrey Pomerantz,et al.  Evaluating and predicting answer quality in community QA , 2010, SIGIR.

[20]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[21]  W. Bruce Croft,et al.  A framework to predict the quality of answers with non-textual features , 2006, SIGIR.

[22]  Yiqun Liu,et al.  Identifying Web Spam with the Wisdom of the Crowds , 2012, TWEB.