Iterative Selection of Categorical Variables for Log Data Anomaly Detection

Log data is a well-known source for anomaly detection in cyber security. Accordingly, a large number of approaches based on self-learning algorithms have been proposed in the past. Most of these approaches focus on numeric features extracted from logs, since these variables are convenient to use with commonly known machine learning techniques. However, system log data frequently involves multiple categorical features that provide further insights into the state of a computer system and thus have the potential to improve detection accuracy. Unfortunately, it is non-trivial to derive useful correlation rules from the vast number of possible values of all available categorical variables. Therefore, we propose the Variable Correlation Detector (VCD) that employs a sequence of selection constraints to efficiently disclose pairs of variables with correlating values. The approach also comprises of an online mode that continuously updates the identified variable correlations to account for system evolution and applies statistical tests on conditional occurrence probabilities for anomaly detection. Our evaluations show that the VCD is well adjustable to fit properties of the data at hand and discloses associated variables with high accuracy. Our experiments with real log data indicate that the VCD is capable of detecting attacks such as scans and brute-force intrusions with higher accuracy than existing detectors.

[1]  Lorentz Jäntschi,et al.  Pearson-Fisher Chi-Square Statistic Revisited , 2011, Inf..

[2]  Ashkan Sami,et al.  SysDetect: A systematic approach to critical state determination for Industrial Intrusion Detection Systems using Apriori algorithm , 2015 .

[3]  Bergsma,et al.  A bias-correction for Cramér ’ s V and Tschuprow ’ s T Wicher , 2012 .

[4]  TahaAyman,et al.  Anomaly Detection Methods for Categorical Data , 2019 .

[5]  Wicher Bergsma,et al.  A bias-correction for Cramér’s and Tschuprow’s , 2013 .

[6]  Brian Hutchinson,et al.  Deep Learning for Unsupervised Insider Threat Detection in Structured Cybersecurity Data Streams , 2017, AAAI Workshops.

[7]  Ali S. Hadi,et al.  Anomaly Detection Methods for Categorical Data , 2019, ACM Comput. Surv..

[8]  Andreas Rauber,et al.  Have It Your Way: Generating Customized Log Data Sets with a Model-driven Simulation Testbed , 2020, 2020 IEEE 20th International Conference on Software Quality, Reliability and Security (QRS).

[9]  Govind P. Gupta,et al.  A Framework for Fast and Efficient Cyber Security Network Intrusion Detection Using Apache Spark , 2016 .

[10]  Amparo Alonso-Betanzos,et al.  Large scale anomaly detection in mixed numerical and categorical input spaces , 2019, Inf. Sci..

[11]  Jill Slay,et al.  The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set , 2016, Inf. Secur. J. A Glob. Perspect..

[12]  Amit Pande,et al.  WEAC: Word embeddings for anomaly classification from event logs , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[13]  Shilin He,et al.  Experience Report: System Log Analysis for Anomaly Detection , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[14]  Jiadong Ren,et al.  Efficient Outlier Detection Algorithm for Heterogeneous Data Streams , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[15]  J. E. Glynn,et al.  Numerical Recipes: The Art of Scientific Computing , 1989 .

[16]  Philippe Fournier-Viger,et al.  Extracting useful knowledge from event logs: A frequent itemset mining approach , 2018, Knowl. Based Syst..

[17]  R. Agarwal Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[18]  Yizhou Sun,et al.  Entity Embedding-Based Anomaly Detection for Heterogeneous Categorical Events , 2016, IJCAI.

[19]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[20]  Jeff G. Schneider,et al.  Detecting anomalous records in categorical datasets , 2007, KDD '07.

[21]  William H. Press,et al.  Numerical recipes , 1990 .

[22]  Hiroyuki Kitagawa,et al.  Detecting Outliers in Categorical Record Databases Based on Attribute Associations , 2008, APWeb.

[23]  Ruggero G. Pensa,et al.  A Semisupervised Approach to the Detection and Characterization of Outliers in Categorical Data , 2017, IEEE Transactions on Neural Networks and Learning Systems.