Nested Dirichlet models for unsupervised attack pattern detection in honeypot data

Cyber-systems are under near-constant threat from intrusion attempts. Attacks types vary, but each attempt typically has a specific underlying intent, and the perpetrators are typically groups of individuals with similar objectives. Clustering attacks appearing to share a common intent is very valuable to threat-hunting experts. This article explores Dirichlet distribution topic models for clustering terminal session commands collected from honeypots, which are special network hosts designed to entice malicious attackers. The main practical implications of clustering the sessions are two-fold: finding similar groups of attacks, and identifying outliers. A range of statistical models are considered, adapted to the structures of command-line syntax. In particular, concepts of primary and secondary topics, and then session-level and command-level topics, are introduced into the models to improve interpretability. The proposed methods are further extended in a Bayesian nonparametric fashion to allow unboundedness in the vocabulary size and the number of latent intents. The methods are shown to discover an unusual MIRAI variant which attempts to take over existing cryptocurrency coin-mining infrastructure, not detected by traditional topic-modelling approaches.

[1]  Aron Laszka,et al.  SoK: The MITRE ATT&CK Framework in Research and Practice , 2023, ArXiv.

[2]  José M. F. Moura,et al.  Statistics and Data Science for Cybersecurity , 2023, Issue 5.1, Winter 2023.

[3]  E. Budiarto,et al.  Mapping Linux Shell Commands to MITRE ATT&CK using NLP-Based Approach , 2022, 2022 International Conference on Electrical Engineering and Informatics (ICELTICs).

[4]  Lizhi Peng,et al.  Mining Function Homology of Bot Loaders from Honeypot Logs , 2022, ArXiv.

[5]  S. Maffeis,et al.  VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection , 2022, 2022 International Joint Conference on Neural Networks (IJCNN).

[6]  A. Lijoi,et al.  Clustering consistency with Dirichlet process mixtures , 2022, Biometrika.

[7]  Shamik Sengupta,et al.  Analysis of Attacker Behavior in Compromised Hosts During Command and Control , 2021, ICC 2021 - IEEE International Conference on Communications.

[8]  W. Hardaker,et al.  Identifying botnet IP address clusters using natural language processing techniques on honeypot command logs , 2021, ArXiv.

[9]  Federico Tomasi,et al.  Stochastic Variational Inference for Dynamic Correlated Topic Models , 2020, UAI.

[10]  Tamara Broderick,et al.  Finite mixture models do not reliably learn the number of components , 2020, ICML.

[11]  Iman Vakilinia,et al.  Analyzing Variation Among IoT Botnets Using Medium Interaction Honeypots , 2020, 2020 10th Annual Computing and Communication Workshop and Conference (CCWC).

[12]  Soham Deshmukh,et al.  Attacker Behaviour Profiling using Stochastic Ensemble of Hidden Markov Models , 2019, ArXiv.

[13]  Luke Miratrix,et al.  Assessing topic model relevance: Evaluation and informative priors , 2019, Stat. Anal. Data Min..

[14]  Francesco Sanna Passino,et al.  Bayesian estimation of the latent dimension and communities in stochastic blockmodels , 2019, Statistics and Computing.

[15]  Rajesh Kumar Shrivastava,et al.  Attack Detection and Forensics Using Honeypot in IoT Environment , 2018, ICDCIT.

[16]  Soham Deshmukh,et al.  Temporal and Stochastic Modelling of Attacker Behaviour , 2018, Advances in Data Science.

[17]  Yanchun Zhang,et al.  Sentence level topic models for associated topics extraction , 2018, World Wide Web.

[18]  Ramesh Nallapati,et al.  Coherence-Aware Neural Topic Modeling , 2018, EMNLP.

[19]  Stephen C. Adams,et al.  Selecting System Specific Cybersecurity Attack Patterns Using Topic Modeling , 2018, 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE).

[20]  Z. Ke,et al.  Using SVD for Topic Modeling , 2017, Journal of the American Statistical Association.

[21]  Måns Magnusson,et al.  Pulling Out the Stops: Rethinking Stopword Removal for Topic Models , 2017, EACL.

[22]  Georgios Balikas,et al.  On a Topic Model for Sentences , 2016, SIGIR.

[23]  Guillaume Bouchard,et al.  Latent IBP Compound Dirichlet Allocation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Ryan P. Adams,et al.  Graph-Sparse LDA: A Topic Model with Structured Sparsity , 2014, AAAI.

[25]  Austin Waters,et al.  Infinite-word topic models for digital media , 2014 .

[26]  Matthew T. Harrison,et al.  Inconsistency of Pitman-Yor process mixtures for the number of components , 2013, J. Mach. Learn. Res..

[27]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[28]  Jordan L. Boyd-Graber,et al.  Online Latent Dirichlet Allocation with Infinite Vocabulary , 2013, ICML.

[29]  Matthew T. Harrison,et al.  A simple example of Dirichlet process mixture inconsistency for the number of components , 2013, NIPS.

[30]  Chong Wang,et al.  Nested Hierarchical Dirichlet Processes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[32]  Jiawei Han,et al.  Locally Consistent Concept Factorization for Document Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[33]  Hiroshi Nakagawa,et al.  Topic models with power-law using Pitman-Yor process , 2010, KDD.

[34]  Chong Wang,et al.  The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling , 2010, ICML.

[35]  Peter A. Chew,et al.  Term Weighting Schemes for Latent Dirichlet Allocation , 2010, NAACL.

[36]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[37]  Lei Wu,et al.  Honeypot detection in advanced botnet attacks , 2010, Int. J. Inf. Comput. Secur..

[38]  Jordan L. Boyd-Graber,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[39]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[40]  A. Gelfand,et al.  The Nested Dirichlet Process , 2008 .

[41]  Michael I. Jordan,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[42]  Iyatiti Mokube,et al.  Honeypots: concepts, approaches, and challenges , 2007, ACM-SE 45.

[43]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[44]  Louise Guthrie,et al.  Another Look at the Data Sparsity Problem , 2006, TSD.

[45]  Ajay Jasra,et al.  Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modeling , 2005 .

[46]  P. Müller,et al.  A method for combining inference across related nonparametric Bayesian models , 2004 .

[47]  Ka Yee Yeung,et al.  Bayesian mixture model based clustering of replicated microarray data , 2004, Bioinform..

[48]  T. Griffiths,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[50]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[51]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[52]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[53]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[54]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[55]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[56]  Yezhou Huang Correlated Topic Models , 2014 .

[57]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[58]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2009 .

[59]  Thomas Hofmann,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2007 .

[60]  D. B. Dahl An improved merge-split sampler for conjugate dirichlet process mixture models , 2003 .