Automatic Mining of Cyber Intelligence from the Darkweb

Introduction Now that we have a better understanding of the hacker communities present on both the darknet and the clearnet, which were discussed in the previous chapter, we can begin to use data-mining and machine-learning techniques to aggregate and analyze the data from these communities, with a goal of providing valuable cyber threat intelligence. This chapter is an extension of the work in [80]. We present a system for cyber threat intelligence gathering, built on top of the data from communities similar to those presented in Chapter 3. At the time of writing, this system collects, on average, 305 high-quality cyber threat warnings each week. These threat warnings contain information regarding malware and exploits, many of which are newly developed and have not yet been deployed in a cyber-attack. This information can be particularly useful for cyberdefenders. Significantly augmented through the use of various data-mining and machine-learning techniques, this system is able to recall 92% of products in marketplaces and 80% of discussions on forums relating to malicious hacking, as labeled by a security analyst, with high precision. Additionally, we will present a model based on topic modeling used for automatic identification of new hacker forums and exploit marketplaces for data collection. In succeeding sections, we will introduce a machine-learning-based scraping infrastructure to gather such intelligence from these online communities. We will also discuss the challenges associated with constructing such a system and how we addressed them. Figure 4.1 shows the number of detected threats for five weeks and Table 4.1 shows the database statistics at the time of writing, which indicates that only a small fraction of the data collected is hacking related. The vendor and user statistics cited only consider those individuals associated in the discussion or sale of malicious hacking-related material, as identified by the system. Specific contributions of this chapter include: • Description of a system for cyber threat intelligence gathering from various social platforms from the Internet such as deepnet and darknet websites. • The implementation and evaluation of learning models to separate relevant information from noise in the data collected from these online platforms. • A machine-learning approach to aid security experts in the discovery of new relevant deepnet and darknet websites of interest using topic modeling—this reduces the time and cost associated with identifying new deepnet and darknet sites.