K-MLP Based Classifier for Discernment of Gratuitous Mails using N-Gram Filtration

Electronic spam is a highly concerning phenomenon over the internet affecting various organisations like Google, Yahoo etc. Email spam causes several serious problems like high utilisation of memory space, financial loss, degradation of computation speed and power, and several threats to authenticated account holders. Email spam allows the spammers to deceit as a legitimate account holder of the organisations to fraud money and other useful information from the victims. It is necessary to control the spreading of spam and to develop an effective and efficient mechanism for defence. In this research, we proposed an efficient method for characterising spam emails using both supervised and unsupervised approaches by boosting the algorithm‘s performance. This study refined a supervised approach, MLP using a fast and efficient unsupervised approach, KMeans for the detection of spam emails by selecting best features using N-Gram technique. The proposed system shows high accuracy with a low error rate in contrast to the existing technique. The system also shows a reduction in vague information when MLP was combined with KMeans algorithm for selecting initial clusters. N-Gram produces 100 best features from the group of data. Finally, the results are demonstrated and the output of the proposed technique is examined in contrast to the existing technique.

[1]  Megha Rathi,et al.  Spam Mail Detection through Data Mining – A Comparative Performance Analysis , 2013 .

[2]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[3]  Saudi Arabia,et al.  N-grams in Texts Categorization , 2007 .

[4]  M. S. B. PhridviRaja,et al.  Data Mining – Past, Present and Future – A Typical Survey on Data Streams☆ , 2014 .

[5]  Monika Jena,et al.  A Study on WEKA Tool for Data Preprocessing, Classification and Clustering , 2013 .

[6]  Anu Soni,et al.  Comparative Study of Various Clustering Techniques in Data Mining , 2015 .

[7]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[8]  Anirudh Harisinghaney,et al.  Text and image based spam email classification using KNN, Naïve Bayes and Reverse DBSCAN algorithm , 2014, 2014 International Conference on Reliability Optimization and Information Technology (ICROIT).

[9]  Ankita Choubey,et al.  A Survey on Classification Techniques in Internet Environment , 2016 .

[10]  Izzat Alsmadi,et al.  Clustering and classification of email contents , 2015, J. King Saud Univ. Comput. Inf. Sci..

[11]  Ethem Alpaydin,et al.  Constructive Feedforward ART Clustering Networks — Part I , 2001 .

[12]  Ethem Alpaydin,et al.  Constructive feedforward ART clustering networks. I , 2002, IEEE Trans. Neural Networks.

[13]  Ali Shafigh Aski,et al.  Proposed efficient algorithm to filter spam using machine learning techniques , 2016 .

[14]  Manish Mahajan,et al.  Outlier Reduction using Hybrid Approach in Data Mining , 2015 .

[15]  Madhu Shukla,et al.  A review on outlier detection techniques on data stream by using different approaches of K-Means algorithm , 2015, 2015 International Conference on Advances in Computer Engineering and Applications.

[16]  Byeong Ho Kang,et al.  Process of Extracting Uncover Patterns from Data: A Review , 2009 .

[17]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[18]  Donghai Guan,et al.  Combining Multi-layer Perceptron and K-Means for Data Clustering with Background Knowledge , 2007, ICIC.

[19]  Ismail M. Romi,et al.  Optimal Clustering Algorithms for Data Mining , 2013 .

[20]  T. Abdul Razak,et al.  An Overview of Various Improvements of DBSCAN Algorithm in Clustering Spatial Databases , 2016 .

[21]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[22]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[23]  Rodica Potolea,et al.  Spam detection filter using KNN algorithm and resampling , 2010, Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing.

[24]  Sandeep Negi A Review on Different Spam Detection Approaches , 2014 .

[25]  D. Y. Patil,et al.  Effective Email Classification for Spam and Non-Spam , 2014 .

[26]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[27]  Václav Snásel,et al.  SPAM DETECTION USING DATA COMPRESSION AND SIGNATURES , 2013, Cybern. Syst..

[28]  Robert A. Lordo,et al.  Learning from Data: Concepts, Theory, and Methods , 2001, Technometrics.

[29]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[30]  A. Joshi A Review: Comparative Study of Various Clustering Techniques in Data Mining , 2013 .

[31]  Bo Yu,et al.  A comparative study for content-based dynamic spam classification using four machine learning algorithms , 2008, Knowl. Based Syst..

[32]  Akebo Yamakami,et al.  Content-based spam filtering , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[33]  Sabhia Firdaus,et al.  A Survey on Clustering Algorithms and Complexity Analysis , 2015 .

[34]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[35]  Robert E. Mercer,et al.  Classifying Spam Emails Using Text and Readability Features , 2013, 2013 IEEE 13th International Conference on Data Mining.

[36]  Zhengyou Zhang,et al.  A Survey of Recent Advances in Face Detection , 2010 .

[37]  A. Jain,et al.  Design, Analysis and Implementation of Modified K-Mean Algorithm for Large Data-Set to Increase Scalability and Efficiency , 2012, 2012 Fourth International Conference on Computational Intelligence and Communication Networks.

[38]  H WittenIan,et al.  The WEKA data mining software , 2009 .

[39]  Jörg Rech,et al.  Knowledge Discovery in Databases , 2001, Künstliche Intell..

[40]  Malik Muneeb Abid,et al.  Study on the Effectiveness of Spam Detection Technologies , 2016 .

[41]  Brian Everitt,et al.  Cluster analysis , 1974 .

[42]  D. Karthika Renuka,et al.  Spam Classification Based on Supervised Learning Using Machine Learning Techniques , 2011, 2011 International Conference on Process Automation, Control and Computing.

[43]  Shahidan M. Abdullah,et al.  Advantage and drawback of support vector machine functionality , 2014, 2014 International Conference on Computer, Communications, and Control Technology (I4CT).

[44]  Dinesh Kumar,et al.  Association Rule Mining Algorithm's Variant Analysis , 2013 .

[45]  Lokesh Singh,et al.  Clustering Techniques: A Brief Survey of Different Clustering Algorithms , 2012 .

[46]  Rek ha A Review on Different Spam Detection Approaches , 2014 .

[47]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[48]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[49]  Michael Stonebraker,et al.  Database research: achievements and opportunities into the 1st century , 1996, SGMD.

[50]  Sameer Dixit,et al.  An Implementation of Data Pre-Processing for Small Dataset , 2014 .