Word sense disambiguation for spam filtering

Spam has become a major issue in computer security because it is a channel for threats such as computer viruses, worms, and phishing. More than 86% of received e-mails are spam. Historical approaches to combating these messages, including simple techniques such as sender blacklisting or the use of e-mail signatures, are no longer completely reliable. Many current solutions feature machine-learning algorithms trained using statistical representations of the terms that most commonly appear in such e-mails. However, these methods are merely syntactic and are unable to account for the underlying semantics of terms within messages. In this paper, we explore the use of semantics in spam filtering by introducing a pre-processing step of Word Sense Disambiguation (WSD). Based upon this disambiguated representation, we apply several well-known machine-learning models and show that the proposed method can detect the internal semantics of spam messages.

[1]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[2]  Xavier Carreras,et al.  FreeLing: An Open-Source Suite of Language Analyzers , 2004, LREC.

[3]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[4]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[5]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[6]  Bilal Bahaa Zaidan,et al.  Impact of spam advertisement through e-mail: A study to assess the influence of the anti-spam on the e-mail marketing , 2010 .

[7]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[8]  Gordon V. Cormack,et al.  TREC 2006 Spam Track Overview , 2006, TREC.

[9]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[12]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[13]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[14]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[15]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications , 2007 .

[16]  W. John Wilbur,et al.  The automatic identification of stop words , 1992, J. Inf. Sci..

[17]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.

[18]  R. Darnell Translation , 1873, The Indian medical gazette.

[19]  Katherine Taken Smith,et al.  Case Studies of Cybercrime and its Impact on Marketing Activity and Shareholder Value , 2010 .

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  Arvinder Kaur,et al.  Comparative analysis of regression and machine learning methods for predicting fault proneness models , 2009, Int. J. Comput. Appl. Technol..

[22]  J. Kent Information gain and a general measure of correlation , 1983 .

[23]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[24]  Yorick Wilks,et al.  Providing machine tractable dictionary tools , 1990, Machine Translation.

[25]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[26]  Hao Xu,et al.  Automatic thesaurus construction for spam filtering using revised back propagation neural network , 2010, Expert Syst. Appl..

[27]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[28]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Robert Krovetz,et al.  Homonymy and Polysemy in Information Retrieval , 1997, ACL.

[31]  Alexander K. Seewald,et al.  An evaluation of Naive Bayes variants in content-based learning for spam filtering , 2007, Intell. Data Anal..

[32]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[33]  梁仲文.,et al.  An analysis of the impact of phishing and anti-phishing related announcements on market value of global firms , 2009 .

[34]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[35]  Robert Krovetz,et al.  On the Importance of Word Sense Disambiguation for Information Retrieval , 2002 .

[36]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[37]  Enrico Blanzieri,et al.  Instance-Based Spam Filtering Using SVM Nearest Neighbor Classifier , 2007, FLAIRS Conference.

[38]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[39]  Christopher Krügel,et al.  Exploiting Redundancy in Natural Language to Penetrate Bayesian Spam Filters , 2007, WOOT.

[40]  Yehoshua Bar-Hillel,et al.  The Present Status of Automatic Translation of Languages , 1960, Adv. Comput..

[41]  Arvind K. Tripathi,et al.  Economic Issues in Advertising via E-Mail: Role for a Trusted Third Party? , 2005 .

[42]  L. Buydens,et al.  Facilitating the application of Support Vector Regression by using a universal Pearson VII function based kernel , 2006 .

[43]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[44]  Blaine Nelson,et al.  Misleading Learners: Co-opting Your Spam Filter , 2009 .

[45]  Gerhard Weikum,et al.  Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification , 2005, PKDD.

[46]  Nick Feamster,et al.  Can DNS-Based Blacklists Keep Up with Bots? , 2006, CEAS.

[47]  John C. Mallery Thinking About Foreign Policy: Finding an Appropriate Role for Artificially Intelligent Computers , 1988 .

[48]  Stephen R. Garner,et al.  WEKA: The Waikato Environment for Knowledge Analysis , 1996 .

[49]  Xavier Carreras,et al.  A Flexible Distributed Architecture for Natural Language Analyzers , 2002, LREC.

[50]  Brian W. Cashell The Economic Impact of Cyber-Attacks , 2004 .

[51]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[52]  Ray Hunt,et al.  Tightening the net: A review of current and next generation spam filtering tools , 2006, Comput. Secur..

[53]  Simon Heron Spam Detection: Technologies for spam detection , 2009 .

[54]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[55]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[56]  Gregory F. Cooper,et al.  A Bayesian Method for Constructing Bayesian Belief Networks from Databases , 1991, UAI.

[57]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[58]  Aditya K Sood Is your System pwned: Is your system pwned? , 2009 .

[59]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[60]  Madeleine Bates,et al.  Challenges in natural language processing: Conclusion , 1993 .

[61]  Rada Mihalcea,et al.  SenseLearner: Word Sense Disambiguation for All Words in Unrestricted Text , 2005, ACL.

[62]  Wolfgang Wahlster,et al.  Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics , 1997 .

[63]  Markus Jakobsson,et al.  Social phishing , 2007, CACM.

[64]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[65]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[66]  Si Wu,et al.  Improving support vector machine classifiers by modifying kernel functions , 1999, Neural Networks.

[67]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[68]  D. Sculley,et al.  Relaxed online SVMs for spam filtering , 2007, SIGIR.

[69]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[70]  Joshua Alspector,et al.  The Impact of Feature Selection on Signature-Driven Spam Detection , 2004, CEAS.

[71]  Hsuan-Tien Lin A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods , 2005 .

[72]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[73]  Julio Gonzalo,et al.  Lexical ambiguity and Information Retrieval revisited , 1999, EMNLP.

[74]  W. Bruce Croft,et al.  Lexical ambiguity and information retrieval , 1992, TOIS.

[75]  Emil Sit,et al.  An empirical study of spam traffic and the use of DNS black lists , 2004, IMC '04.

[76]  Ellen M. Voorhees Natural Language Processing and Information Retrieval , 1999, SCIE.

[77]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[78]  Igor Santos,et al.  Enhanced Topic-based Vector Space Model for semantics-aware spam filtering , 2012, Expert Syst. Appl..