Using evidence based content trust model for spam detection

Content trust is one of the main components in the research of information retrieval. As it gets easier to add information to the Web via HTML pages, wikis, blogs, and other documents, it gets tougher to distinguish accurate or trustworthy information from inaccurate or untrustworthy information on the Web. Current technology of spam detection is based on binary metric, that is binary classification is adapted in the spam detection. In order to meet the users' need and preference, more accurate metric is needed in the content trust as well as in detecting spam information. In this paper, we use the notion of content trust for spam detection, and regard it as a ranking problem. Besides traditional text feature attributes, information quality based evidence is introduced to define the trust feature of spam information, and a novel content trust learning algorithm based on these evidence is proposed. Finally, a Web spam detection system is developed and the experiments on the real Web data are carried out, which show the proposed method performs very well in practice.

[1]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[2]  Alexander Pretschner,et al.  Ontology-based web site mapping for information exploration , 1999, CIKM '99.

[3]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[4]  Yolanda Gil,et al.  Towards content trust of web resources , 2006, WWW '06.

[5]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[6]  Quan Zhang,et al.  EviRank: An Evidence Based Content Trust Model for Web Spam Detection , 2007, APWeb/WAIM Workshops.

[7]  Ramanathan V. Guha,et al.  Propagation of trust and distrust , 2004, WWW '04.

[8]  Wang Wei,et al.  Trusted dynamic level scheduling based on Bayes trust model , 2007, Science in China Series F: Information Sciences.

[9]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[10]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[11]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[13]  Hector Garcia-Molina,et al.  The Eigentrust algorithm for reputation management in P2P networks , 2003, WWW '03.

[14]  Wang Wei,et al.  Trust based cooperative system formation and evolution , 2006 .

[15]  Harry Zhang,et al.  Naive Bayesian Classifiers for Ranking , 2004, ECML.

[16]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[17]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[18]  Lulai Yuan,et al.  A Semantic Reputation Mechanism in P2P Semantic Web , 2006, ASWC.

[19]  Tie-Yan Liu,et al.  Adapting ranking SVM to document retrieval , 2006, SIGIR.

[20]  Ralf Herbrich,et al.  Large margin rank boundaries for ordinal regression , 2000 .

[21]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[22]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[23]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[24]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[25]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[26]  Jianying Wang,et al.  A corpus analysis approach for automatic query expansion and its extension to multiple databases , 1999, TOIS.

[27]  Panagiotis Takis Metaxas,et al.  Web Spam, Propaganda and Trust , 2005, AIRWeb.

[28]  Malik Magdon-Ismail,et al.  Optimal Link Bombs are Uncoordinated , 2005, AIRWeb.

[29]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[30]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[31]  Ricardo A. Baeza-Yates,et al.  Pagerank Increase under Different Collusion Topologies , 2005, AIRWeb.