The Classification Power of Web Features

Abstract In this article we give a comprehensive overview of features devised for web spam detection and investigate how much various classes, some requiring very high computational effort, add to the classification accuracy. We collect and handle a large number of features based on recent advances in web spam filtering, including temporal ones; in particular, we analyze the strength and sensitivity of linkage change. We propose new, temporal link-similarity-based features and show how to compute them efficiently on large graphs. We show that machine learning techniques, including ensemble selection, LogitBoost, and random forest significantly improve accuracy. We conclude that, with appropriate learning techniques, a simple and computationally inexpensive feature subset outperforms all previous results published so far on our dataset and can be further improved only slightly by computationally expensive features. We test our method on three major publicly available datasets: the Web Spam Challenge 2008 dataset WEBSPAM-UK2007, the ECML/PKDD Discovery Challenge dataset DC2010, and the Waterloo Spam Rankings for ClueWeb09. Our classifier ensemble sets the strongest classification benchmark compared to participants of the Web Spam and ECML/PKDD Discovery Challenges as well as the TREC Web track. To foster research in the area, we make several feature sets and source codes public,1 https://datamining.sztaki.hu/en/download/web-spam-resources including the temporal features of eight .uk crawl snapshots that include WEBSPAM-UK2007 as well as the Web Spam Challenge features for the labeled part of ClueWeb09.

[1]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[2]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[3]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[4]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[5]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[6]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[7]  Brian D. Davison,et al.  Web Spam Challenge , 2007 .

[8]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[9]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[10]  Abhishek Mathur,et al.  Content based web spam detection using naive bayes with different feature representation technique , 2013 .

[11]  Hector Garcia-Molina,et al.  Spam: it's not just for inboxes anymore , 2005, Computer.

[12]  Calton Pu,et al.  Predicting web spam with HTTP session information , 2008, CIKM '08.

[13]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[14]  Sebastiano Vigna,et al.  Temporal Evolution of the UK Web , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[15]  William W. Cohen,et al.  Stacked Graphical Models for Efficient Inference in Markov Random Fields , 2007, SDM.

[16]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[17]  Jaideep Srivastava,et al.  Incremental page rank computation on evolving graphs , 2005, WWW '05.

[18]  Gilad Mishne,et al.  Towards recency ranking in web search , 2010, WSDM '10.

[19]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[20]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[21]  Gordon V. Cormack,et al.  On-line spam filter fusion , 2006, SIGIR.

[22]  Torsten Suel,et al.  Cleaning search results using term distance features , 2008, AIRWeb '08.

[23]  W. Marsden I and J , 2012 .

[24]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[25]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[26]  Masaru Kitsuregawa,et al.  A study of link farm distribution and evolution using a time series of web snapshots , 2009, AIRWeb '09.

[27]  Idit Keidar,et al.  Do not crawl in the DUST: different URLs with similar text , 2006, WWW.

[28]  András A. Benczúr,et al.  Content-based trust and bias classification via biclustering , 2012, WebQuality '12.

[29]  Ryan Shaun Joazeiro de Baker,et al.  Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction , 2005, Graphics Interface.

[30]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[31]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[32]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[33]  Luca Becchetti,et al.  Link-Based Characterization and Detection of Web Spam , 2006, AIRWeb.

[34]  Károly Csalogány,et al.  Semi-supervised learning: a comparative study for web spam and telephone user churn , 2007 .

[35]  Rich Caruana,et al.  Getting the Most Out of Ensemble Selection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[36]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[37]  Kevin S. McCurley,et al.  Ranking the web frontier , 2004, WWW '04.

[38]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[39]  Amit Singhal,et al.  Challenges in running a commercial search engine , 2005, SIGIR '05.

[40]  Zsolt Fekete,et al.  Web spam: a survey with vision for the archivist , 2008 .

[41]  Tie-Yan Liu,et al.  Detecting Link Spam Using Temporal Information , 2006, Sixth International Conference on Data Mining (ICDM'06).

[42]  Jianying Hu,et al.  Winning the KDD Cup Orange Challenge with Ensemble Selection , 2009, KDD Cup.

[43]  András A. Benczúr,et al.  Temporal Analysis for Web Spam Detection: An Overview , 2011, TWAW.

[44]  Yun Chi,et al.  Splog Detection using Content, Time and Link Structures , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[45]  Sebastiano Vigna,et al.  A large time-aware web graph , 2008, SIGF.

[46]  Zoltan Gyongyi,et al.  AIRWeb 2009, Fifth International Workshop on Adversarial Information Retrieval on the Web, Madrid, Spain, April 21, 2009 , 2009, AIRWeb.

[47]  Eli Upfal,et al.  Web search using automatic classification , 1996, WWW 1996.

[48]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[49]  Jian Pei,et al.  A Spamicity Approach to Web Spam Detection , 2008, SDM.

[50]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[51]  Brian D. Davison,et al.  Looking into the past to better classify web spam , 2009, AIRWeb '09.

[52]  Brian D. Davison,et al.  Adversarial Web Search , 2011, Found. Trends Inf. Retr..

[53]  Ludovic Denoyer,et al.  MADSPAM Consortium at the ECML/PKDD Discovery Challenge 2010 , 2010 .

[54]  Jacob Abernethy WITCH: A NEW APPROACH TO WEB SPAM DETECTION , 2008 .

[55]  Ludovic Denoyer,et al.  Web spam challenge 2008 , 2008, AIRWeb 2008.

[56]  Wolfgang Nejdl,et al.  Efficient Parallel Computation of PageRank , 2006, ECIR.

[57]  András A. Benczúr,et al.  Web spam filtering in internet archives , 2009, AIRWeb '09.

[58]  Ricardo Baeza-Yates,et al.  Coniunge et Impera: Multiple-Graph Mining for Query-Log Analysis , 2010, ECML/PKDD.

[59]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[60]  Xinchang Zhang,et al.  Evaluating Web Content Quality via Multi-scale Features , 2013, ArXiv.

[61]  András A. Benczúr,et al.  Web spam classification: a few features worth more , 2011, WebQuality '11.

[62]  András A. Benczúr,et al.  Web spam challenge proposal for filtering in archives , 2009, AIRWeb '09.

[63]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[64]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[65]  Jaideep Srivastava,et al.  Divide and conquer approach for efficient pagerank computation , 2006, ICWE '06.