论文信息 - Statistical compression-based models for text classification

Statistical compression-based models for text classification

Text classification is the task of assigning predefined categories to text documents. It is a common machine learning problem. Statistical text classification that makes use of machine learning methods to learn classification rules are particularly known to be successful in this regard. In this research project we are trying to re-invent the text classification problem with a sound methodology based on statistical data compression technique-the Minimum Message Length (MML) principle. To model the data sequence we have used the Probabilistic Finite State Automata (PFSAs). We propose two approaches for text classification using the MML-PFSAs. We have tested both the approaches with the Enron spam dataset and the results of our empirical evaluation has been recorded in terms of the well known classification measures i.e. recall, precision, accuracy and error. The results indicate good classification accuracy that can be compared with the state of art classifiers.

David L. Dowe | Sid Ray | Vidya Saikrishna

[1] David L. Dowe,et al. MML inference of Finite State Automata for probabilistic spam detection , 2015, 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR).

[2] Constantine D. Spyropoulos,et al. An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[3] Akebo Yamakami,et al. Advances in Spam Filtering Techniques , 2012, Computational Intelligence for Privacy and Security.

[4] C. S. Wallace,et al. Classification by Minimum-Message-Length Inference , 1991, ICCI.

[5] Blaz Zupan,et al. Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[6] Vangelis Metsis,et al. Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[7] David L. Dowe,et al. Foreword re C. S. Wallace , 2008, Comput. J..

[8] Sid Ray,et al. Improved Approximate Multiple-Pattern String Matching using Consecutive N-Grams , 2013 .

[9] C. S. Wallace,et al. Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[10] C. S. Wallace,et al. Estimation and Inference by Compact Coding , 1987 .

[11] David L. Dowe,et al. MML, hybrid Bayesian network graphical models, statistical consistency, invarianc , 2010 .

[12] C. S. Wallace,et al. An Information Measure for Classification , 1968, Comput. J..