Exploring discrepancies in findings obtained with the KDD Cup '99 data set

The KDD Cup '99 data set has been widely used to evaluate intrusion detection prototypes, most based on machine learning techniques, for nearly a decade. The data set served well in the KDD Cup '99 competition to demonstrate that machine learning can be useful in intrusion detection systems. However, there are discrepancies in the findings reported in the literature. Further, some researchers have published criticisms of the data (and the DARPA data from which the KDD Cup '99 data has been derived), questioning the validity of results obtained with this data. Despite the criticisms, researchers continue to use the data due to a lack of better publicly available alternatives. Hence, it is important to identify the value of the data set and the findings from the extensive body of research based on it, which has largely been ignored by the existing critiques. This paper reports on an empirical investigation, demonstrating the impact of several methodological differences in the publicly available subsets, which uncovers several underlying causes of the discrepancy in the results reported in the literature. These findings allow us to better interpret the current body of research, and inform recommendations for future use of the data set.

[1]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[2]  Giovanni Vigna,et al.  NetSTAT: a network-based intrusion detection approach , 1998, Proceedings 14th Annual Computer Security Applications Conference (Cat. No.98EX217).

[3]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[4]  Ajith Abraham,et al.  ANTIDS: Self Orga nized Ant-Based C lustering Model for Intrusion Det ection System , 2004, WSTST.

[5]  Abdolreza Mirzaei,et al.  Intrusion detection using fuzzy association rules , 2009, Appl. Soft Comput..

[6]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[7]  Andrew H. Sung,et al.  Artificial intelligent techniques for intrusion detection , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[8]  Ajith Abraham,et al.  Ensemble of One-Class Classifiers for Network Intrusion Detection System , 2008, 2008 The Fourth International Conference on Information Assurance and Security.

[9]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[10]  Fabio Roli,et al.  Intrusion detection in computer networks by a modular ensemble of one-class classifiers , 2008, Inf. Fusion.

[11]  Leonid Portnoy,et al.  Intrusion detection with unlabeled data using clustering , 2000 .

[12]  Philip K. Chan,et al.  An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network Anomaly Detection , 2003, RAID.

[13]  Richard A. Kemmerer,et al.  State Transition Analysis: A Rule-Based Intrusion Detection Approach , 1995, IEEE Trans. Software Eng..

[14]  Dieter Gollmann,et al.  Computer Security , 1979, Lecture Notes in Computer Science.

[15]  Jonathan Timmis,et al.  Artificial Immune Systems: A New Computational Intelligence Approach , 2003 .

[16]  Ulf Lindqvist,et al.  Detecting computer and network misuse through the production-based expert system toolset (P-BEST) , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[17]  Salvatore J. Stolfo,et al.  A framework for constructing features and models for intrusion detection systems , 2000, TSEC.

[18]  Sokratis K. Katsikas,et al.  Intrusion Detection Using Evolutionary Neural Networks , 2008, 2008 Panhellenic Conference on Informatics.

[19]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[20]  Dong Xiang,et al.  Information-theoretic measures for anomaly detection , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[21]  Richard Lippmann,et al.  The 1999 DARPA off-line intrusion detection evaluation , 2000, Comput. Networks.

[22]  Michael Kearns,et al.  A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split , 1995, Neural Computation.

[23]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[24]  Eleazar Eskin,et al.  A GEOMETRIC FRAMEWORK FOR UNSUPERVISED ANOMALY DETECTION: DETECTING INTRUSIONS IN UNLABELED DATA , 2002 .

[25]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[26]  Emin Anarim,et al.  An intelligent intrusion detection system (IDS) for anomaly and misuse detection in computer networks , 2005, Expert Syst. Appl..

[27]  Dirk Van den Poel,et al.  Handling class imbalance in customer churn prediction , 2009, Expert Syst. Appl..

[28]  Mohammad Zulkernine,et al.  Efficacy of Hidden Markov Models Over Neural Networks in Anomaly Intrusion Detection , 2006, 30th Annual International Computer Software and Applications Conference (COMPSAC'06).

[29]  Jonathan Timmis,et al.  Artificial immune systems - a new computational intelligence paradigm , 2002 .

[30]  Manas Ranjan Patra,et al.  Ensemble of classifiers for detecting network intrusion , 2009, ICAC3 '09.

[31]  Christopher Krügel,et al.  Intrusion Detection and Correlation - Challenges and Solutions , 2004, Advances in Information Security.

[32]  Christopher Leckie,et al.  Unsupervised Anomaly Detection in Network Intrusion Detection Using Clusters , 2005, ACSC.

[33]  Keith Phalp,et al.  Enhancing network based intrusion detection for imbalanced data , 2008, Int. J. Knowl. Based Intell. Eng. Syst..

[34]  Lundy M. Lewis,et al.  A Case-Based Reasoning Approach to the Resolution of Faults in Communication Networks , 1993, Integrated Network Management.

[35]  John McHugh,et al.  Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory , 2000, TSEC.

[36]  Kristopher Kendall,et al.  A Database of Computer Attacks for the Evaluation of Intrusion Detection Systems , 1999 .

[37]  Manas Ranjan Patra,et al.  NETWORK INTRUSION DETECTION USING NAÏVE BAYES , 2007 .

[38]  Daoqiang Zhang,et al.  Hybrid neural network and C4.5 for misuse detection , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[39]  Joshua Alspector,et al.  Data duplication: an imbalance problem ? , 2003 .

[40]  Thomas Stützle,et al.  Ant Colony Optimization , 2009, EMO.

[41]  A.H. Sung,et al.  Identifying important features for intrusion detection using support vector machines and neural networks , 2003, 2003 Symposium on Applications and the Internet, 2003. Proceedings..

[42]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[43]  D. Dasgupta Artificial Immune Systems and Their Applications , 1998, Springer Berlin Heidelberg.

[44]  Salem Benferhat,et al.  On the combination of naive Bayes and decision trees for intrusion detection , 2005, International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06).

[45]  Frédéric Cuppens,et al.  Detecting Known and Novel Network Intrusions , 2006, SEC.

[46]  Gürsel Serpen,et al.  Application of Machine Learning Algorithms to KDD Intrusion Detection Dataset within Misuse Detection Context , 2003, MLMTA.

[47]  James S. Albus,et al.  New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[48]  Hewijin Christine Jiau,et al.  Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem , 2006 .

[49]  Jesus A. Gonzalez,et al.  Machine Learning for Imbalanced Datasets: Application in Medical Diagnostic , 2006, FLAIRS.

[50]  Giovanni Vigna,et al.  Intrusion detection: a brief history and overview , 2002 .

[51]  Cannady,et al.  Next Generation Intrusion Detection: Autonomous Reinforcement Learning of Network Attacks , 2000 .

[52]  Xiangliang Zhang,et al.  Processing of massive audit data streams for real-time anomaly intrusion detection , 2008, Comput. Commun..

[53]  Ajith Abraham,et al.  Modeling intrusion detection system using hybrid intelligent systems , 2007, J. Netw. Comput. Appl..

[54]  Ian Witten,et al.  Data Mining , 2000 .

[55]  Zied Elouedi,et al.  Naive Bayes vs decision trees in intrusion detection systems , 2004, SAC '04.

[56]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[57]  Peter J. Bentley,et al.  Towards an artificial immune system for network intrusion detection: an investigation of dynamic clonal selection , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[58]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[59]  Nicoletta Dessì,et al.  Intelligent Bayesian Classifiers in Network Intrusion Detection , 2005, IEA/AIE.

[60]  Andrew H. Sung,et al.  Identifying key features for intrusion detection using neural networks , 2002 .

[61]  Efstathios Stamatatos,et al.  Author identification: Using text sampling to handle the class imbalance problem , 2008, Inf. Process. Manag..

[62]  Ulf Lindqvist,et al.  eXpert-BSM: a host-based intrusion detection solution for Sun Solaris , 2001, Seventeenth Annual Computer Security Applications Conference.

[63]  Charles Elkan,et al.  Results of the KDD'99 classifier learning , 2000, SKDD.

[64]  R.K. Cunningham,et al.  Evaluating intrusion detection systems: the 1998 DARPA off-line intrusion detection evaluation , 2000, Proceedings DARPA Information Survivability Conference and Exposition. DISCEX'00.

[65]  Yalin Zheng,et al.  Learning from imbalanced data: a comparative study for colon CAD , 2008, SPIE Medical Imaging.

[66]  Aurobindo Sundaram,et al.  An introduction to intrusion detection , 1996, CROS.

[67]  Koral Ilgun,et al.  USTAT: a real-time intrusion detection system for UNIX , 1993, Proceedings 1993 IEEE Computer Society Symposium on Research in Security and Privacy.

[68]  Ratan K. Guha,et al.  Effective intrusion detection using multiple sensors in wireless ad hoc networks , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[69]  Marco Dorigo Ant colony optimization , 2004, Scholarpedia.

[70]  Marc Dacier,et al.  Towards a taxonomy of intrusion-detection systems , 1999, Comput. Networks.

[71]  Ali A. Ghorbani,et al.  Comparative Study of Supervised Machine Learning Techniques for Intrusion Detection , 2007, Fifth Annual Conference on Communication Networks and Services Research (CNSR '07).

[72]  Charles V. Wright,et al.  HMM profiles for network traffic classification , 2004, VizSEC/DMSEC '04.

[73]  Gürsel Serpen,et al.  Why machine learning algorithms fail in misuse detection on KDD intrusion detection data set , 2004, Intell. Data Anal..

[74]  Adam Przepiórkowski,et al.  Definition Extraction with Balanced Random Forests , 2008, GoTAL.

[75]  Nikos Fakotakis,et al.  Learning Greek Verb Complements: Addressing the Class Imbalance , 2004, COLING.

[76]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[77]  Ajith Abraham,et al.  Feature deduction and ensemble design of intrusion detection systems , 2005, Comput. Secur..

[78]  Eric W. T. Ngai,et al.  Customer churn prediction using improved balanced random forests , 2009, Expert Syst. Appl..

[79]  Dorothy E. Denning,et al.  An Intrusion-Detection Model , 1987, IEEE Transactions on Software Engineering.

[80]  D. Endler,et al.  Intrusion detection. Applying machine learning to Solaris audit data , 1998, Proceedings 14th Annual Computer Security Applications Conference (Cat. No.98EX217).

[81]  Cheng Xiang,et al.  Design of Multiple-Level Hybrid Classifier for Intrusion Detection System , 2005, 2005 IEEE Workshop on Machine Learning for Signal Processing.