Domain-Embeddings Based DGA Detection with Incremental Training Method

DGA-based botnet, which uses Domain Generation Algorithms (DGAs) to evade supervision, has become a part of the most destructive threats to network security. Over the past decades, a wealth of defense mechanisms focusing on domain features have emerged to address the problem. Nonetheless, DGA detection remains a daunting and challenging task due to the big data nature of Internet traffic and the potential fact that the linguistic features extracted only from the domain names are insufficient and the enemies could easily forge them to disturb detection. In this paper, we propose a novel DGA detection system which employs an incremental word-embeddings method to capture the interactions between end hosts and domains, characterize time-series patterns of DNS queries for each IP address and therefore explore temporal similarities between domains. We carefully modify the Word2Vec algorithm and leverage it to automatically learn dynamic and discriminative feature representations for over 1.9 million domains, and develop an simple classifier for distinguishing malicious domains from the benign. Given the ability to identify temporal patterns of domains and update models incrementally, the proposed scheme makes the progress towards adapting to the changing and evolving strategies of DGA domains. Our system is evaluated and compared with the state-of-art system FANCI and two deep-learning methods CNN and LSTM, with data from a large university’s network named TUNET. The results suggest that our system outperforms the strong competitors by a large margin on multiple metrics and meanwhile achieves a remarkable speed-up on model updating.

[1]  Kevin Duh,et al.  Streaming Word Embeddings with the Space-Saving Algorithm , 2017, ArXiv.

[2]  Roberto Perdisci,et al.  From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware , 2012, USENIX Security Symposium.

[3]  Zhen Wang,et al.  A Detection Scheme for DGA Domain Names Based on SVM , 2018 .

[4]  Ulrike Meyer,et al.  FANCI : Feature-based Automated NXDomain Classification and Intelligence , 2018, USENIX Security Symposium.

[5]  Pierre Lison,et al.  Automatic Detection of Malware-Generated Domains with Recurrent Neural Models , 2017, ArXiv.

[6]  Amit Arora Dns2Vec: Exploring Internet Domain Names through Deep Learning , 2019 .

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  Joewie J. Koh,et al.  Inline Detection of Domain Generation Algorithms with Context-Sensitive Word Embeddings , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[9]  Martine De Cock,et al.  Dictionary Extraction and Detection of Algorithmically Generated Domain Names in Passive DNS Traffic , 2018, RAID.

[10]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[11]  Hui Zhang,et al.  D3N: DGA Detection with Deep-Learning Through NXDomain , 2019, KSEM.

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Waldemar López,et al.  Vector representation of internet domain names using a word embedding technique , 2017, 2017 XLIII Latin American Computer Conference (CLEI).

[14]  Johannes Bader,et al.  A Comprehensive Measurement Study of Domain Generating Malware , 2016, USENIX Security Symposium.

[15]  Hai Anh Tran,et al.  A LSTM based framework for handling multiclass imbalance in DGA botnet detection , 2018, Neurocomputing.

[16]  Xiaofei Wu,et al.  Domain2Vec: Vector representation of mobile server's domain based on mobile user visiting sequences , 2018, 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA).

[17]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[18]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[19]  Nobuhiro Kaji,et al.  Incremental Skip-gram Model with Negative Sampling , 2017, EMNLP.

[20]  Sandeep Yadav,et al.  Detecting algorithmically generated malicious domain names , 2010, IMC '10.

[21]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..