Effect Of Feature Selection, Synthetic Minority Over-sampling (SMOTE) And Under-sampling On Class imbalance Classification

Accurate identification of network intrusions is one of the biggest challenges of Network Intrusion Detection (NID) systems. In recent years Machine learning classification techniques have been used to precisely identify network intrusion. However, the multi class distribution in network intrusion detection system has found to be highly skewed, leading to classification accuracy problem due to class imbalance data set. The work presented in this paper not only explores the role of the attribute selection in improving classification accuracy but also investigates the problem of class imbalance using the Synthetic Minority Over-sampling (SMOTE) and under sampling of major classes. The classification performance is then evaluated over several types of classifiers. The outcome of this work is that for the class imbalance data set the under-sampling technique is more effective than SMOTE in detecting minor classes. It has also found during this research work that the decision tree algorithms (JRIP) and Naive Bayes are more accurate classifiers as compared to the Radial basis neural network and support vector machine. However no single algorithm can be used for the classification of multiclass and it is proposed in this research work that combination of classifier consisting of Naive Bayes and JRIP could be used for the classification of minor classes in an imbalance class data set of intrusion detection system.

[1]  A.H. Sung,et al.  Identifying important features for intrusion detection using support vector machines and neural networks , 2003, 2003 Symposium on Applications and the Internet, 2003. Proceedings..

[2]  GuoHongyu,et al.  Learning from imbalanced data sets with boosting and data generation , 2004 .

[3]  Efstathios Stamatatos,et al.  Author identification: Using text sampling to handle the class imbalance problem , 2008, Inf. Process. Manag..

[4]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[5]  Salvatore J. Stolfo,et al.  Cost-based modeling for fraud and intrusion detection: results from the JAM project , 2000, Proceedings DARPA Information Survivability Conference and Exposition. DISCEX'00.

[6]  Nathalie Japkowicz,et al.  A Mixture-of-Experts Framework for Learning from Imbalanced Data Sets , 2001, IDA.

[7]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.

[8]  Christopher Krügel,et al.  Intrusion Detection and Correlation - Challenges and Solutions , 2004, Advances in Information Security.

[9]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[10]  Daoqiang Zhang,et al.  Hybrid neural network and C4.5 for misuse detection , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[11]  Ali Al-Shahib,et al.  Franksum: new feature selection method for protein function prediction , 2005, Int. J. Neural Syst..

[12]  John McHugh,et al.  Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory , 2000, TSEC.

[13]  Hewijin Christine Jiau,et al.  Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem , 2006 .

[14]  Li Guo,et al.  Survey and Taxonomy of Feature Selection Algorithms in Intrusion Detection System , 2006, Inscrypt.

[15]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[16]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .

[17]  Giovanni Vigna,et al.  Intrusion detection: a brief history and overview , 2002 .