Using semi-supervised machine learning to address the Big Data problem in DNS networks

The problem of Big Data in cyber security (i.e., too much network data to analyze) compounds itself every day. Our approach is based on a fundamental characteristic of Big Data: an overwhelming majority of the network traffic in a traditionally secured enterprise (i.e., using defense-in-depth) is non-malicious. Therefore, one way of eliminating the Big Data problem in cyber security is to ignore the overwhelming majority of an enterprise's non-malicious network traffic and focus only on the smaller amounts of suspicious or malicious network traffic. Our approach uses simple clustering along with a dataset enriched with known malicious domains (i.e., anchors) to accurately and quickly filter out the non-suspicious network traffic. Our algorithm has demonstrated the predictive ability to accurately filter out approximately 97% (depending on the algorithm used) of the non-malicious data in millions of Domain Name Service (DNS) queries in minutes and identify the small percentage of unseen suspicious network traffic. We demonstrate that the resulting network traffic can be analyzed with traditional reputation systems, blacklists, or in-house threat tracking sources (we used virustotal.com) to identify harmful domains that are being accessed from within the enterprise network. Specifically, our results show that the method can reduce a dataset of 400k query-answer domains (with complete malicious domain ground truth) down to only 3% containing 99% of all malicious domains. Further, we demonstrate that this capability scales to 10 million query-answer pairs, which it can reduce by 97% in less than an hour.