论文信息 - Using semi-supervised machine learning to address the Big Data problem in DNS networks

Using semi-supervised machine learning to address the Big Data problem in DNS networks

The problem of Big Data in cyber security (i.e., too much network data to analyze) compounds itself every day. Our approach is based on a fundamental characteristic of Big Data: an overwhelming majority of the network traffic in a traditionally secured enterprise (i.e., using defense-in-depth) is non-malicious. Therefore, one way of eliminating the Big Data problem in cyber security is to ignore the overwhelming majority of an enterprise's non-malicious network traffic and focus only on the smaller amounts of suspicious or malicious network traffic. Our approach uses simple clustering along with a dataset enriched with known malicious domains (i.e., anchors) to accurately and quickly filter out the non-suspicious network traffic. Our algorithm has demonstrated the predictive ability to accurately filter out approximately 97% (depending on the algorithm used) of the non-malicious data in millions of Domain Name Service (DNS) queries in minutes and identify the small percentage of unseen suspicious network traffic. We demonstrate that the resulting network traffic can be analyzed with traditional reputation systems, blacklists, or in-house threat tracking sources (we used virustotal.com) to identify harmful domains that are being accessed from within the enterprise network. Specifically, our results show that the method can reduce a dataset of 400k query-answer domains (with complete malicious domain ground truth) down to only 3% containing 99% of all malicious domains. Further, we demonstrate that this capability scales to 10 million query-answer pairs, which it can reduce by 97% in less than an hour.

[1] Anna L. Buczak,et al. Detection of Tunnels in PCAP Data by Random Forests , 2016, CISRC.

[2] Kenton Born,et al. Detecting DNS Tunnels Using Character Frequency Analysis , 2010, ArXiv.

[3] Leyla Bilge,et al. EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis , 2011, NDSS.

[4] Yingjie Tian,et al. Semi-supervised learning methods for network intrusion detection , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[5] Stefano Zanero,et al. Phoenix: DGA-Based Botnet Tracking and Intelligence , 2014, DIMVA.

[6] Tom M. Mitchell,et al. Weakly Supervised Extraction of Computer Security Events from Twitter , 2015, WWW.

[7] Yuancheng Li,et al. A semi-supervised learning approach for detection of phishing webpages , 2013 .

[8] Roberto Perdisci,et al. From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware , 2012, USENIX Security Symposium.

[9] Sandeep Yadav,et al. Detecting algorithmically generated malicious domain names , 2010, IMC '10.

[10] Yuh-Jye Lee,et al. Semi-supervised Learning for False Alarm Reduction , 2010, ICDM.

[11] Jun Zhang,et al. A novel semi-supervised approach for network traffic clustering , 2011, 2011 5th International Conference on Network and System Security.