Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence

To adapt to the rapidly evolving landscape of cyber threats, security professionals are actively exchanging Indicators of Compromise (IOC) (e.g., malware signatures, botnet IPs) through public sources (e.g. blogs, forums, tweets, etc.). Such information, often presented in articles, posts, white papers etc., can be converted into a machine-readable OpenIOC format for automatic analysis and quick deployment to various security mechanisms like an intrusion detection system. With hundreds of thousands of sources in the wild, the IOC data are produced at a high volume and velocity today, which becomes increasingly hard to manage by humans. Efforts to automatically gather such information from unstructured text, however, is impeded by the limitations of today's Natural Language Processing (NLP) techniques, which cannot meet the high standard (in terms of accuracy and coverage) expected from the IOCs that could serve as direct input to a defense system. In this paper, we present iACE, an innovation solution for fully automated IOC extraction. Our approach is based upon the observation that the IOCs in technical articles are often described in a predictable way: being connected to a set of context terms (e.g., "download") through stable grammatical relations. Leveraging this observation, iACE is designed to automatically locate a putative IOC token (e.g., a zip file) and its context (e.g., "malware", "download") within the sentences in a technical article, and further analyze their relations through a novel application of graph mining techniques. Once the grammatical connection between the tokens is found to be in line with the way that the IOC is commonly presented, these tokens are extracted to generate an OpenIOC item that describes not only the indicator (e.g., a malicious zip file) but also its context (e.g., download from an external source). Running on 71,000 articles collected from 45 leading technical blogs, this new approach demonstrates a remarkable performance: it generated 900K OpenIOC items with a precision of 95% and a coverage over 90%, which is way beyond what the state-of-the-art NLP technique and industry IOC tool can achieve, at a speed of thousands of articles per hour. Further, by correlating the IOCs mined from the articles published over a 13-year span, our study sheds new light on the links across hundreds of seemingly unrelated attack instances, particularly their shared infrastructure resources, as well as the impacts of such open-source threat intelligence on security protection and evolution of attack strategies.

[1]  Hai-Xin Duan,et al.  Seeking Nonsense, Looking for Trouble: Efficient Promotional-Infection Detection through Semantic Inconsistency Search , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[2]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[3]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[4]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[5]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[6]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[7]  Marco Balduzzi,et al.  Automatic Extraction of Indicators of Compromise for Web Applications , 2016, WWW.

[8]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[9]  Steven M. Bellovin,et al.  Privee: An Architecture for Automatically Analyzing Web Privacy Policies , 2014, USENIX Security Symposium.

[10]  Xiaofeng Wang,et al.  UIPicker: User-Input Privacy Identification in Mobile Applications , 2015, USENIX Security Symposium.

[11]  Tao Xie,et al.  WHYPER: Towards Automating Risk Assessment of Mobile Applications , 2013, USENIX Security Symposium.

[12]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[13]  Tudor Dumitras,et al.  Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits , 2015, USENIX Security Symposium.

[14]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[15]  Nguyen Bach,et al.  A Review of Relation Extraction , 2007 .

[16]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[17]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[18]  Dan Roth,et al.  An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines) , 2012, LREC.

[19]  Zhong Chen,et al.  AutoCog: Measuring the Description-to-permission Fidelity in Android Applications , 2014, CCS.

[20]  Tao Zhang,et al.  AutoPPG: Towards Automatic Generation of Privacy Policy for Android Applications , 2015, SPSM@CCS.

[21]  神薗 雅紀,et al.  Structured Threat Information eXpression で記述された情報のモデル化 , 2018 .

[22]  Jan Ramon,et al.  Expressivity versus efficiency of graph kernels , 2003 .

[23]  Xiangyu Zhang,et al.  SUPOR: Precise and Scalable Sensitive User Input Detection for Android Apps , 2015, USENIX Security Symposium.

[24]  Leo Obrst,et al.  Developing an Ontology of the Cyber Security Domain , 2012, STIDS.

[25]  A. John MINING GRAPH DATA , 2022 .