论文信息 - Exploring linguistic features for web spam detection: a preliminary study

Exploring linguistic features for web spam detection: a preliminary study

We study the usability of linguistic features in the Web spam classification task. The features were computed on two Web spam corpora: Webspam-Uk2006 and Webspam-Uk2007, we make them publicly available for other researchers. Preliminary analysis seems to indicate that certain linguistic features may be useful for the spam-detection task when combined with features studied elsewhere.

[1] Marc Najork,et al. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[2] J. Nunamaker,et al. Automating Linguistics-Based Cues for Detecting Deception in Text-Based Asynchronous Computer-Mediated Communications , 2004 .

[3] Marc Najork,et al. Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[4] Gilad Mishne,et al. Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[5] Tobias Scheffer,et al. Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam , 2005, ECML.

[6] Andrea Esuli,et al. SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[7] Luca Becchetti,et al. A reference collection for web spam , 2006, SIGF.

[8] Thomas Lavergne,et al. Tracking Web Spam with Hidden Style Similarity , 2006, AIRWeb.

[9] Marc Najork,et al. Detecting spam web pages through content analysis , 2006, WWW '06.

[10] C. Castillo,et al. Application of Machine Learning in Combating Web Spam , 2007 .

[11] András A. Benczúr,et al. Web spam detection via commercial intent analysis , 2007, AIRWeb '07.

[12] Fabrizio Silvestri,et al. Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[13] Jacob Abernethy. WITCH: A NEW APPROACH TO WEB SPAM DETECTION , 2008 .

[14] Jakub Piskorski,et al. CORLEONE Core Linguistic Entity Online Extraction , 2008 .