DeepWeak: Reasoning common software weaknesses via knowledge graph embedding

Common software weaknesses, such as improper input validation, integer overflow, can harm system security directly or indirectly, causing adverse effects such as denial-of-service, execution of unauthorized code. Common Weakness Enumeration (CWE) maintains a standard list and classification of common software weakness. Although CWE contains rich information about software weaknesses, including textual descriptions, common sequences and relations between software weaknesses, the current data representation, i.e., hyperlined documents, does not support advanced reasoning tasks on software weaknesses, such as prediction of missing relations and common consequences of CWEs. Such reasoning tasks become critical to managing and analyzing large numbers of common software weaknesses and their relations. In this paper, we propose to represent common software weaknesses and their relations as a knowledge graph, and develop a translation-based, description-embodied knowledge representation learning method to embed both software weaknesses and their relations in the knowledge graph into a semantic vector space. The vector representations (i.e., embeddings) of software weaknesses and their relations can be exploited for knowledge acquisition and inference. We conduct extensive experiments to evaluate the performance of software weakness and relation embeddings in three reasoning tasks, including CWE link prediction, CWE triple classification, and common consequence prediction. Our knowledge graph embedding approach outperforms other description- and/or structure-based representation learning methods.

[1]  Zhiyuan Liu,et al.  Knowledge Representation via Joint Learning of Sequential Text and Knowledge Graphs , 2016, ArXiv.

[2]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[3]  Zhenchang Xing,et al.  Learning to Predict Severity of Software Vulnerability Using Only Vulnerability Description , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[4]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[5]  Wenhan Xiong,et al.  DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning , 2017, EMNLP.

[6]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[7]  Gary McGraw,et al.  Seven Pernicious Kingdoms: A Taxonomy of Software Security Errors , 2005, IEEE Secur. Priv..

[8]  Zhenchang Xing,et al.  Predicting semantically linkable knowledge in developer online forums via convolutional neural network , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9]  Yuanzhuo Wang,et al.  Locally Adaptive Translation for Knowledge Graph Embedding , 2015, AAAI.

[10]  Zhenchang Xing,et al.  Mining Technology Landscape from Stack Overflow , 2016, ESEM.

[11]  Juan-Zi Li,et al.  Text-Enhanced Representation Learning for Knowledge Graph , 2016, IJCAI.

[12]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[13]  Ramesh Nallapati,et al.  Multi-instance Multi-label Learning for Relation Extraction , 2012, EMNLP.

[14]  Le Song,et al.  Know-Evolve: Deep Temporal Reasoning for Dynamic Knowledge Graphs , 2017, ICML.

[15]  Xiang Li,et al.  A Mining Approach to Obtain the Software Vulnerability Characteristics , 2017, 2017 Fifth International Conference on Advanced Cloud and Big Data (CBD).

[16]  Zhiyuan Liu,et al.  Representation Learning of Knowledge Graphs with Entity Descriptions , 2016, AAAI.

[17]  Rong Liu,et al.  Joint Semantic Relevance Learning with Text Data and Graph Knowledge , 2015, CVSC.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[20]  Xiao Ma,et al.  From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[21]  Zhiyuan Liu,et al.  Learning Entity and Relation Embeddings for Knowledge Graph Completion , 2015, AAAI.

[22]  Zhenchang Xing,et al.  TechLand: Assisting Technology Landscape Inquiries with Insights from Stack Overflow , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[23]  Robert A. Martin,et al.  Common weakness enumeration (CWE) status update , 2008, ALET.

[24]  Zhen Wang,et al.  Aligning Knowledge and Text Embeddings by Entity Descriptions , 2015, EMNLP.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Shirley M. Radack,et al.  National Vulnerability Database: Helping Information Technology System Users and Developers Find Current Information about Cyber Security Vulnerabilities | NIST , 2005 .

[27]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[28]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[29]  Kan Chen,et al.  Knowledge Graph Representation with Jointly Structural and Textual Encoding , 2016, IJCAI.

[30]  Le Song,et al.  Variational Reasoning for Question Answering with Knowledge Graph , 2017, AAAI.

[31]  Zhen Wang,et al.  Knowledge Graph Embedding by Translating on Hyperplanes , 2014, AAAI.

[32]  Danqi Chen,et al.  Reasoning With Neural Tensor Networks for Knowledge Base Completion , 2013, NIPS.