Semi-supervised learning: a comparative study for web spam and telephone user churn

We compare a wide range of semi-supervised learning techniques both for Web spam filtering and for telephone user churn classification. Semisupervised learning has the assumption that the label of a node in a graph is similar to those of its neighbors. In this paper we measure this phenomenon both for Web spam and telco churn. We conclude that spam is often linked to spam while honest pages are linked to honest ones; similarly churn occurs in bursts in groups of a social network.

[1]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[2]  András A. Benczúr,et al.  Link-Based Similarity Search to Fight Web Spam , 2006, AIRWeb.

[3]  Ramanathan V. Guha,et al.  Propagation of trust and distrust , 2004, WWW '04.

[4]  Evangelos E. Milios,et al.  Node similarity in networked information spaces , 2001, CASCON.

[5]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[6]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[7]  William W. Cohen,et al.  Stacked Graphical Models for Efficient Inference in Markov Random Fields , 2007, SDM.

[8]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[9]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[10]  Chih-Ping Wei,et al.  Turning telecommunications call details to churn prediction: a data mining approach , 2002, Expert Syst. Appl..

[11]  Hector Garcia-Molina,et al.  Link Spam Alliances , 2005, VLDB.

[12]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[13]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[14]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[15]  Tobias Scheffer,et al.  Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam , 2005, ECML.

[16]  Panagiotis Takis Metaxas,et al.  Web Spam, Propaganda and Trust , 2005, AIRWeb.

[17]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[18]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[19]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[20]  Xin Yao,et al.  A novel evolutionary data mining algorithm with applications to churn prediction , 2003, IEEE Trans. Evol. Comput..

[21]  Brian D. Davison,et al.  Knowing a web page by the company it keeps , 2006, CIKM '06.

[22]  András A. Benczúr,et al.  To randomize or not to randomize: space optimal summaries for hyperlink analysis , 2006, WWW '06.

[23]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[24]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[25]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[26]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.