MML inference of Finite State Automata for probabilistic spam detection

MML (Minimum Message Length) has emerged as a powerful tool in inductive inference of discrete, continuous and hybrid structures. The Probabilistic Finite State Automaton (PFSA) is one such discrete structure that needs to be inferred for classes of problems in the field of Computer Science including artificial intelligence, pattern recognition and data mining. MML has also served as a viable tool in many classes of problems in the field of Machine Learning including both supervised and unsupervised learning. The classification problem is the most common among them. This research is a two-fold solution to a problem where one part focusses on the best inferred PFSA using MML and the second part focusses on the classification problem of Spam Detection. Using the best PFSA inferred in part 1, the Spam Detection theory has been tested using MML on a publicly available Enron Spam dataset. The filter was evaluated on various performance parameters like precision and recall. The evaluation was also done taking into consideration the cost of misclassification in terms of weighted accuracy rate and weighted error rate. The results of our empirical evaluation indicate the classification accuracy to be around 93%, which outperforms well-known established spam filters.

[1]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[2]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[3]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[4]  David L. Dowe,et al.  MML, hybrid Bayesian network graphical models, statistical consistency, invarianc , 2010 .

[5]  David L. Dowe,et al.  Introduction to Ray Solomonoff 85th Memorial Conference , 2011, Algorithmic Probability and Friends.

[6]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[7]  David L. Dowe,et al.  Intrinsic classification by MML - the Snob program , 1994 .

[8]  David L. Dowe,et al.  Point Estimation Using the Kullback-Leibler Loss Function and MML , 1998, PAKDD.

[9]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[10]  Brian R. Gaines,et al.  Behaviour/structure transformations under uncertainty , 1976 .

[11]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[12]  Peter Andreae,et al.  A beam search algorithm for PFSA inference , 1998, Pattern Analysis and Applications.

[13]  Jorma Tarhio,et al.  Multipattern string matching with q-grams , 2007, ACM J. Exp. Algorithmics.

[14]  David L. Dowe,et al.  Foreword re C. S. Wallace , 2008, Comput. J..

[15]  David L. Dowe,et al.  Minimum Message Length and Kolmogorov Complexity , 1999, Comput. J..

[16]  C. S. Wallace,et al.  MML mixture modelling of multi-state, Poisson, von Mises circular and Gaussian distributions , 1997 .

[17]  Philip Hingston,et al.  Inference of regular languages using model simplicity , 2001, Proceedings 24th Australian Computer Science Conference. ACSC 2001.