Machine intelligence-based algorithms for spam filtering on document labeling

The internet has provided numerous modes for secure data transmission from one end station to another, and email is one of those. The reason behind its popular usage is its cost-effectiveness and facility for fast communication. In the meantime, many undesirable emails are generated in a bulk format for a monetary benefit called spam. Despite the fact that people have the ability to promptly recognize an email as spam, performing such task may waste time. To simplify the classification task of a computer in an automated way, a machine learning method is used. Due to limited availability of datasets for email spam, constrained data and the text written in an informal way are the most feasible issues that forced the current algorithms to fail to meet the expectations during classification. This paper proposed a novel, spam mail detection method based on the document labeling concept which classifies the new ones into ham or spam. Moreover, algorithms like Naive Bayes, Decision Tree and Random Forest (RF) are used in the classification process. Three datasets are used to evaluate how the proposed algorithm works. Experimental results illustrate that RF has higher accuracy when compared with other methods.

[1]  Antonino Staiano,et al.  Machine learning and soft computing for ICT security: an overview of current trends , 2011, Journal of Ambient Intelligence and Humanized Computing.

[2]  Abdulrahman A. Mirza,et al.  Spammer Classification Using Ensemble Methods over Structural Social Network Features , 2014, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[3]  C. R. Kavitha,et al.  Spam Detection Using Machine Learning in R , 2018, International Conference on Computer Networks and Communication Technologies.

[4]  M. Bassiouni,et al.  Ham and Spam E-Mails Classification Using Machine Learning Techniques , 2018 .

[5]  Shrawan Kumar Trivedi,et al.  Interplay between Probabilistic Classifiers and Boosting Algorithms for Detecting Complex Unsolicited Emails , 2013 .

[6]  Mohamed Ghailani,et al.  A Study on Email Spam Filtering Techniques , 2010 .

[7]  Ajith Abraham,et al.  Feature deduction and ensemble design of intrusion detection systems , 2005, Comput. Secur..

[8]  Alexander J. Smola,et al.  Collaborative Email-Spam Filtering with the Hashing-Trick , 2009 .

[9]  Jun Ho Huh,et al.  Hybrid spam filtering for mobile communication , 2009, Comput. Secur..

[10]  Dennis McLeod,et al.  A Comparative Study for Email Classification , 2007 .

[11]  Pilsung Kang,et al.  Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec , 2019, Inf. Sci..

[12]  Ajith Abraham,et al.  A Profile Based Network Intrusion Detection and Prevention System for Securing Cloud Environment , 2013, Int. J. Distributed Sens. Networks.

[13]  Rafid Sagban,et al.  Swarm intelligence in anomaly detection systems: an overview , 2018, International Journal of Computers and Applications.

[14]  Suresh Merugu,et al.  Text Message Classification Using Supervised Machine Learning Algorithms , 2018, ICC 2018.

[15]  Chia-Lin Lee,et al.  A High Performance Image-Spam Filtering System , 2010, 2010 Ninth International Symposium on Distributed Computing and Applications to Business, Engineering and Science.

[16]  Laxmi Ahuja Handling Web Spamming Using Logic Approach , 2018 .

[17]  Diego Klabjan,et al.  Three iteratively reweighted least squares algorithms for $$L_1$$L1-norm principal component analysis , 2017, Knowledge and Information Systems.

[18]  Uffe Kock Wiil,et al.  Modeling Suspicious Email Detection using Enhanced Feature Selection , 2013, ArXiv.

[19]  Devottam Gaurav,et al.  Detection of False Positive Situation in Review Mining , 2019 .

[20]  Nasrullah Memon,et al.  Detection of Fraudulent Emails by Employing Advanced Feature Abundance , 2014 .

[21]  Álvaro Herrero,et al.  MOVIH-IDS: A mobile-visualization hybrid intrusion detection system , 2009, Neurocomputing.

[22]  Aakanksha Sharaff,et al.  Comparative Study of Classification Algorithms for Spam Email Detection , 2016 .

[23]  Antonino Staiano,et al.  Investigation of Single Nucleotide Polymorphisms Associated to Familial Combined Hyperlipidemia with Random Forests , 2012, WIRN.