Inline Detection of Domain Generation Algorithms with Context-Sensitive Word Embeddings

Domain generation algorithms (DGAs) are frequently employed by malware to generate domains used for connecting to command-and-control (C2) servers. Recent work in DGA detection leveraged deep learning architectures like convolutional neural networks (CNNs) and character-level long short-term memory networks (LSTMs) to classify domains. However, these classifiers perform poorly with wordlist-based DGA families, which generate domains by pseudorandomly concatenating dictionary words. We propose a novel approach that combines context-sensitive word embeddings with a simple fully-connected classifier to perform classification of domains based on word-level information. The word embeddings were pre-trained on a large unrelated corpus and left frozen during the training on domain data. The resulting small number of trainable parameters enabled extremely short training durations, while the transfer of language knowledge stored in the representations allowed for high-performing models with small training datasets. We show that this architecture reliably outperformed existing techniques on wordlist-based DGA families with just 30 DGA training examples and achieved state-of-the-art performance with around 100 DGA training examples, all while requiring an order of magnitude less time to train compared to current techniques. Of special note is the technique’s performance on the matsnu DGA: the classifier attained a 89.5% detection rate with a 1:1,000 false positive rate (FPR) after training on only 30 examples of the DGA domains, and a 91.2% detection rate with a 1:10,000 FPR after 90 examples. Considering that some of these DGAs have wordlists of several hundred words, our results demonstrate that this technique does not rely on the classifier learning the DGA wordlists. Instead, the classifier is able to learn the semantic signatures of the wordlist-based DGA families.

[1]  Martine De Cock,et al.  Algorithmically Generated Domain Detection and Malware Family Classification , 2018, SSCC.

[2]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[3]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[4]  Johannes Bader,et al.  A Comprehensive Measurement Study of Domain Generating Malware , 2016, USENIX Security Symposium.

[5]  Pierre Lison,et al.  Automatic Detection of Malware-Generated Domains with Recurrent Neural Models , 2017, ArXiv.

[6]  Martine De Cock,et al.  Inline DGA Detection with Deep Networks , 2017, 2017 IEEE International Conference on Data Mining Workshops (ICDMW).

[7]  Ting Yu,et al.  A Survey on Malicious Domains Detection through DNS Data Analysis , 2018, ACM Comput. Surv..

[8]  Roberto Perdisci,et al.  From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware , 2012, USENIX Security Symposium.

[9]  D. McClish Analyzing a Portion of the ROC Curve , 1989, Medical decision making : an international journal of the Society for Medical Decision Making.

[10]  Hyrum S. Anderson,et al.  Predicting Domain Generation Algorithms with Long Short-Term Memory Networks , 2016, ArXiv.

[11]  Martine De Cock,et al.  Dictionary Extraction and Detection of Algorithmically Generated Domain Names in Passive DNS Traffic , 2018, RAID.

[12]  Yuewei Dai,et al.  A Novel Detection Method for Word-Based DGA , 2018, ICCCS.

[13]  Cheng Huang,et al.  WOVSQLI: Detection of SQL Injection Behaviors Using Word Vector and LSTM , 2018, ICCSP.

[14]  Martine De Cock,et al.  Character Level based Detection of DGA Domain Names , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[15]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[16]  Ryan R. Curtin,et al.  Detecting DGA domains with recurrent neural networks and side information , 2018, ARES.

[17]  Jeffrey Spaulding D-FENS: DNS Filtering & Extraction Network System for Malicious Domain Names , 2018 .

[18]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[19]  Hai Anh Tran,et al.  DGA Botnet Detection Using Supervised Learning Methods , 2017, SoICT.

[20]  Hai Anh Tran,et al.  A LSTM based framework for handling multiclass imbalance in DGA botnet detection , 2018, Neurocomputing.

[21]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.