论文信息 - Spam Detection Using Character N-Grams

Spam Detection Using Character N-Grams

This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional 'bag of words' representation, we use a 'bag of character n-grams' representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or 'deep' text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines. Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application.

Efstathios Stamatatos | Ioannis Kanaris | Konstantinos Kanaris

[1] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[2] Helmut Berger,et al. On the Impact of Document Representation on Classifier Per-formance in e-Mail Categorization , 2005, ISTA.

[3] Georgios Paliouras,et al. Stacking Classifiers for Anti-Spam Filtering of E-Mail , 2001, EMNLP.

[4] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5] Johan Hovold,et al. Naive Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds , 2005, CEAS.

[6] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[7] Georgios Paliouras,et al. An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[8] Fuchun Peng,et al. N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[9] Georgios Paliouras,et al. Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[10] Georgios Paliouras,et al. A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[11] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12] Harris Drucker,et al. Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[13] Mark Levene,et al. A Suffix Tree Approach to Text Categorisation Applied to Spam Filtering , 2005 .

[14] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[15] Nello Cristianini,et al. Classification using String Kernels , 2000 .

[16] Susan T. Dumais,et al. A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[17] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.