A context-aware approach to detection of short irrelevant texts

This paper presents a simple and effective framework that can detect irrelevant short text contents following blogs and news articles, etc. in a context-aware and timely fashion. Nowadays, websites such as Linkedin.com and CNN.com allow their visitors to leave comments after articles, and spammers are exploiting this feature to post irrelevant contents. Visited by millions of readers per day, these websites have extremely high visibility, and irrelevant comments have a detrimental effect on the visiting traffic and revenue of these websites. Therefore, it is critical to eliminate these irrelevant comments as accurately and early as possible. Different from traditional text mining tasks, comments following news and blog articles are characterized by briefness and context-dependent semantics, making it difficult to measure semantic relevance. What's worse, there could be only a handful of comments soon after an article is posted, leading to a severe lack of information for semantics and relevance measurement. We propose to infer “context-aware semantics” to address the above challenges in a unified framework. Specifically, we construct contexts for comments using either blocks of surrounding comments, or comments collected via a principled transfer learning approach. The constructed contexts mitigate the sparseness and sharply define context-dependent semantics of comments, even at the early stage of commenting activities, allowing traditional dimension reduction methods to better capture the semantics of short texts in a context-aware way. We confirm the effectiveness of the proposed method on two real world datasets consisting of news and blog articles and comments, with a maximal improvement of 20% in Area Under Precision-Recall Curve.

[1]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[2]  Xiaolin Du,et al.  Short Text Classification: A Survey , 2014, J. Multim..

[3]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[4]  Ravi Kant,et al.  Comment spam detection by sequence mining , 2012, WSDM '12.

[5]  Leman Akoglu,et al.  Discovering Opinion Spammer Groups by Network Footprints , 2015, ECML/PKDD.

[6]  Arjun Mukherjee,et al.  Spotting fake reviewer groups in consumer reviews , 2012, WWW.

[7]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[8]  Abhinav Kumar,et al.  Spotting opinion spammers using behavioral footprints , 2013, KDD.

[9]  Xiaohui Yan,et al.  Learning Topics in Short Texts by Non-negative Matrix Factorization on Term Correlation Matrix , 2013, SDM.

[10]  Ee-Peng Lim,et al.  Detecting product review spammers using rating behaviors , 2010, CIKM.

[11]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[12]  W. Bruce Croft,et al.  Quary Expansion Using Local and Global Document Analysis , 1996, SIGIR Forum.

[13]  Gilad Mishne,et al.  Leave a Reply: An Analysis of Weblog Comments , 2006 .

[14]  Philip S. Yu,et al.  Deriving latent social impulses to determine longevous videos , 2014, WWW '14 Companion.

[15]  Arjun Mukherjee,et al.  Exploiting Burstiness in Reviews for Review Spammer Detection , 2021, ICWSM.

[16]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[17]  Traian Rebedea,et al.  Relevance-Based Ranking of Video Comments on YouTube , 2013, 2013 19th International Conference on Control Systems and Computer Science.

[18]  Ling Huang,et al.  Robust detection of comment spam using entropy rate , 2012, AISec.

[19]  Philip S. Yu,et al.  Review Graph Based Online Store Review Spammer Detection , 2011, 2011 IEEE 11th International Conference on Data Mining.

[20]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[21]  Philip S. Yu,et al.  Learning Entity Types from Query Logs via Graph-Based Modeling , 2015, CIKM.

[22]  Susan T. Dumais,et al.  Improving information retrieval using latent semantic indexing , 1988 .

[23]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[24]  Philip S. Yu,et al.  Assessing the longevity of online videos: A new insight of a video's quality , 2014, 2014 International Conference on Data Science and Advanced Analytics (DSAA).

[25]  Christopher Meek,et al.  Improving Similarity Measures for Short Segments of Text , 2007, AAAI.

[26]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[27]  Chengqi Zhang,et al.  TCSST: transfer classification of short & sparse text using external data , 2012, CIKM.

[28]  Qiang Yang,et al.  Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[29]  Philip S. Yu,et al.  Diversionary comments under political blog posts , 2012, CIKM.

[30]  Jacob Soman Saini A Study of Spam Detection Algorithm on Social Media Networks , 2014 .

[31]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[32]  Qiang Yang,et al.  Contextual Collaborative Filtering via Hierarchical Matrix Factorization , 2012, SDM.

[33]  Lin Hong-fei LDA-Based Opinion Spam Discovering , 2011 .

[34]  Yan Liu,et al.  Towards Twitter context summarization with user influence models , 2013, WSDM.

[35]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[36]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[37]  Archana Bhattarai,et al.  Characterizing comment spam in the blogosphere through content analysis , 2009, 2009 IEEE Symposium on Computational Intelligence in Cyber Security.

[38]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[39]  Jenq-Haur Wang,et al.  Using Inter-comment Similarity for Comment Spam Detection in Chinese Blogs , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[40]  B. Philippe,et al.  Parallel Algorithms for the Singular Value Decomposition , 2005 .