Content-based trust and bias classification via biclustering

In this paper we improve trust, bias and factuality classification over Web data on the domain level. Unlike the majority of literature in this area that aims at extracting opinion and handling short text on the micro level, we aim to aid a researcher or an archivist in obtaining a large collection that, on the high level, originates from unbiased and trustworthy sources. Our method generates features as Jensen-Shannon distances from centers in a host-term biclustering. On top of the distance features, we apply kernel methods and also combine with baseline text classifiers. We test our method on the ECML/PKDD Discovery Challenge data set DC2010. Our method improves over the best achieved text classification NDCG results by over 3--10% for neutrality, bias and trustworthiness. The fact that the ECML/PKDD Discovery Challenge 2010 participants reached an AUC only slightly above 0.5 indicates the hardness of the task.

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  Gabriela Csurka,et al.  Adapted Vocabularies for Generic Visual Categorization , 2006, ECCV.

[3]  Kumar Chellapilla,et al.  Fourth international workshop on adversarial information retrieval on the web (AIRWeb 2008) , 2008, WWW.

[4]  Brian D. Davison,et al.  AIRWeb 2007 : proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, May 8, 2007, Banff, Alberta, Canada , 2007 .

[5]  Brian D. Davison,et al.  Web Spam Challenge , 2007 .

[6]  Hector Garcia-Molina,et al.  Spam: it's not just for inboxes anymore , 2005, Computer.

[7]  Róbert Pethes,et al.  SZTAKI @ ImageCLEF 2011 , 2011, CLEF.

[8]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[9]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[10]  Yves Grandvalet,et al.  Y.: SimpleMKL , 2008 .

[11]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[12]  Ryan Shaun Joazeiro de Baker,et al.  Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction , 2005, Graphics Interface.

[13]  Zoltan Gyongyi,et al.  AIRWeb 2009, Fifth International Workshop on Adversarial Information Retrieval on the Web, Madrid, Spain, April 21, 2009 , 2009, AIRWeb.

[14]  Ludovic Denoyer,et al.  MADSPAM Consortium at the ECML/PKDD Discovery Challenge 2010 , 2010 .

[15]  Jacob Abernethy WITCH: A NEW APPROACH TO WEB SPAM DETECTION , 2008 .

[16]  Ludovic Denoyer,et al.  Web spam challenge 2008 , 2008, AIRWeb 2008.

[17]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[18]  Amit Singhal,et al.  Challenges in running a commercial search engine , 2005, SIGIR '05.

[19]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[20]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[21]  William W. Cohen,et al.  Stacked Graphical Models for Efficient Inference in Markov Random Fields , 2007, SDM.

[22]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[23]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[24]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[25]  Xinchang Zhang,et al.  Evaluating Web Content Quality via Multi-scale Features , 2013, ArXiv.

[26]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[27]  András A. Benczúr,et al.  Web spam classification: a few features worth more , 2011, WebQuality '11.

[28]  Hsuan-Tien Lin,et al.  An Ensemble Ranking Solution for the Yahoo ! Learning to Rank Challenge , 2010 .

[29]  Abhishek Mathur,et al.  Content based web spam detection using naive bayes with different feature representation technique , 2013 .