Using CGAN to Deal with Class Imbalance and Small Sample Size in Cybersecurity Problems

Predictive modelling in cybersecurity domains usually involves dealing with complex settings. The class imbalance problem is a well-know challenge typically present in the cybersecurity domain. For instance, in a real-world intrusion detection scenario, the number of attacks is expected to be a a very small percentage of the normal cases. Moreover, in these applications, the number of available examples labelled is also small due to the complexity and cost of the labelling process: teams of domain experts need to be involved in the process which becomes expensive, time consuming and prone to errors. To address these problems is critical to the success of predictive modelling in cybersecurity applications. In this paper we tackle the class imbalance and small sample size through the use of a CGAN-based up-sampling procedure. We carry out an extensive set of experiments that show the positive impact of applying this solution to address the class imbalance and small sample size problems. A large data repository is built and freely provided to the research community containing 114 binary datasets based on real-world cybersecurity problems that are generated with diversified levels of imbalance and sample size. Our experiments show a clear advantage of using the CGAN-based up-sampling method specially for situations where the sample size is small and there is a large imbalance between the problem classes. In the most critical scenarios associated with extreme rarity and very small sample size, an impressive performance boost is achieved. We also explore the behaviour of this approach when the presence of these problems is less marked and we found that, while CGAN-based up-sampling is not able to further improve the minority class performance, it also has no negative impact. Thus, it is a safe to use solution, also in these scenarios.