Predicting malware threat intelligence using KGs

Large amounts of threat intelligence information about malware attacks are available in disparate, typically unstructured, formats. Knowledge graphs can capture this information and its context using RDF triples represented by entities and relations. Sparse or inaccurate threat information, however, leads to challenges such as incomplete or erroneous triples. Generic information extraction (IE) models used to populate the knowledge graph cannot fully guarantee domain-specific context. This paper proposes a system to generate a Malware Knowledge Graph called MalKG, the first open-source automated knowledge graph for malware threat intelligence. MalKG dataset (MT40K1) contains approximately 40,000 triples generated from 27,354 unique entities and 34 relations. For ground truth, we manually curate a knowledge graph called MT3K, with 3,027 triples generated from 5,741 unique entities and 22 relations. We demonstrate the intelligence prediction of MalKG using two use cases. Predicting malware threat information using benchmark model achieves 80.4 for the hits@10 metric (predicts the top 10 options for an information class), and 0.75 for the MRR (mean reciprocal rank). We also propose an automated, contextual framework for information extraction, both manually and automatically, at the sentence level from 1,100 malware threat reports and from the common vulnerabilities and exposures (CVE) database.

[1]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[2]  Timothy M. Hospedales,et al.  TuckER: Tensor Factorization for Knowledge Graph Completion , 2019, EMNLP.

[3]  Jun Zhao,et al.  Knowledge Graph Embedding via Dynamic Mapping Matrix , 2015, ACL.

[4]  Roland Vollgraf,et al.  FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP , 2019, NAACL.

[5]  Wei Zhang,et al.  Semantics-Based Online Malware Detection: Towards Efficient Real-Time Protection Against Malware , 2016, IEEE Transactions on Information Forensics and Security.

[6]  Guillaume Bouchard,et al.  Complex Embeddings for Simple Link Prediction , 2016, ICML.

[7]  Chenliang Li,et al.  A Survey on Deep Learning for Named Entity Recognition , 2018, IEEE Transactions on Knowledge and Data Engineering.

[8]  Xianpei Han,et al.  A Probabilistic Co-Bootstrapping Method for Entity Set Expansion , 2014, COLING.

[9]  Jiawei Han,et al.  SetExpan: Corpus-Based Set Expansion via Context Feature Selection and Rank Ensemble , 2017, ECML/PKDD.

[10]  V. S. Subrahmanian,et al.  Ensemble Models for Data-driven Prediction of Malware Infections , 2016, WSDM.

[11]  Philip S. Yu,et al.  Multi-grained Named Entity Recognition , 2019, ACL.

[12]  Yu Hao,et al.  TransA: An Adaptive Approach for Knowledge Graph Embedding , 2015, ArXiv.

[13]  Zhen Wang,et al.  Knowledge Graph Embedding by Translating on Hyperplanes , 2014, AAAI.

[14]  Hans-Peter Kriegel,et al.  A Three-Way Model for Collective Learning on Multi-Relational Data , 2011, ICML.

[15]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[16]  Li Guo,et al.  Semantically Smooth Knowledge Graph Embedding , 2015, ACL.

[17]  Lorrie Faith Cranor,et al.  Building an Ontology of Cyber Security , 2014, STIDS.

[18]  Zhiyuan Liu,et al.  Learning Entity and Relation Embeddings for Knowledge Graph Completion , 2015, AAAI.

[19]  Xiao Lin,et al.  Building Knowledge Base through Deep Learning Relation Extraction and Wikidata , 2019, AAAI Spring Symposium: Combining Machine Learning with Knowledge Engineering.

[20]  D. Richard Kuhn,et al.  Managing Security: The Security Content Automation Protocol , 2011, IT Professional.

[21]  Maosong Sun,et al.  DocRED: A Large-Scale Document-Level Relation Extraction Dataset , 2019, ACL.

[22]  Robert A. Bridges,et al.  Towards a Relation Extraction Framework for Cyber-Security Concepts , 2015, CISR.

[23]  Lorenzo Rosasco,et al.  Holographic Embeddings of Knowledge Graphs , 2015, AAAI.

[24]  Christopher D. Manning,et al.  Improved Pattern Learning for Bootstrapped Entity Extraction , 2014, CoNLL.

[25]  Ivan Zelinka,et al.  A dynamic Windows malware detection and prediction method based on contextual understanding of API call sequence , 2020, Comput. Secur..

[26]  Seyed Mehran Kazemi,et al.  SimplE Embedding for Link Prediction in Knowledge Graphs , 2018, NeurIPS.

[27]  Xiaojie Yuan,et al.  Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches , 2010, COLING.

[28]  Mohammed J. Zaki,et al.  MALOnt: An Ontology for Malware Threat Intelligence , 2020, Deployable Machine Learning for Security Defense.

[29]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[30]  Jun Zhao,et al.  Knowledge Graph Completion with Adaptive Sparse Transfer Matrix , 2016, AAAI.

[31]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[32]  Pasquale Minervini,et al.  Convolutional 2D Knowledge Graph Embeddings , 2017, AAAI.

[33]  Atsuhiro Takasu,et al.  Analyzing Knowledge Graph Embedding Methods from a Multi-Embedding Interaction Perspective , 2019, EDBT/ICDT Workshops.

[34]  Knowledge Graph Fact Prediction via Knowledge-Enriched Tensor Factorization , 2019 .

[35]  Zhendong Mao,et al.  Knowledge Graph Embedding: A Survey of Approaches and Applications , 2017, IEEE Transactions on Knowledge and Data Engineering.

[36]  Jianfeng Gao,et al.  Embedding Entities and Relations for Learning and Inference in Knowledge Bases , 2014, ICLR.

[37]  Michael Gamon,et al.  Representing Text for Joint Embedding of Text and Knowledge Bases , 2015, EMNLP.

[38]  Evgeniy Gabrilovich,et al.  A Review of Relational Machine Learning for Knowledge Graphs , 2015, Proceedings of the IEEE.

[39]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[40]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[41]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[42]  Jonathan M. Spring,et al.  Historical Analysis of Exploit Availability Timelines , 2020, CSET @ USENIX Security Symposium.

[43]  Jun Zhao,et al.  Distant Supervision for Relation Extraction with Sentence-Level Attention and Entity Descriptions , 2017, AAAI.

[44]  Serif Bahtiyar,et al.  A multi-dimensional machine learning approach to predict advanced malware , 2019, Comput. Networks.