GapFinder: Finding Inconsistency of Security Information From Unstructured Text

Textual data mining of open source intelligence on the Web has become an increasingly important topic across a wide range of domains such as business, law enforcement, military, and cybersecurity. Text mining efforts utilize natural language processing to transform unstructured web content into structured forms that can drive various machine learning applications and data indexing services. For example, applications for text mining in cybersecurity have produced a range of threat intelligence services that serve the IT industry. However, a less studied problem is that of automating the identification of semantic inconsistencies among various text input sources. In this paper, we introduce GapFinder, a new inconsistency checking system for identifying semantic inconsistencies within the cybersecurity domain. Specifically, we examine the problem of identifying technical inconsistencies that arise in the functional descriptions of open source malware threat reporting information. Our evaluation, using tens of thousands of relations derived from web-based malware threat reports, demonstrates the ability of GapFinder to identify the presence of inconsistencies.

[1]  Mihai Surdeanu,et al.  Customizing an Information Extraction System to a New Domain , 2011, RELMS@ACL.

[2]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[3]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[4]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[5]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[6]  Eric Michael Hutchins,et al.  Intelligence-Driven Computer Network Defense Informed by Analysis of Adversary Campaigns and Intrusion Kill Chains , 2010 .

[7]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[8]  L. Getoor,et al.  1 Global Inference for Entity and Relation Identification via a Linear Programming Formulation , 2007 .

[9]  Zhou Li,et al.  Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence , 2016, CCS.

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  D. Roth 1 Global Inference for Entity and Relation Identification via a Linear Programming Formulation , 2007 .

[12]  Mihai Surdeanu,et al.  Robust Information Extraction with Perceptrons , 2007 .

[13]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[14]  Timothy W. Finin,et al.  Extracting Cybersecurity Related Linked Data from Text , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[15]  Tudor Dumitras,et al.  FeatureSmith: Automatically Engineering Features for Malware Detection by Mining the Security Literature , 2016, CCS.

[16]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[17]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[18]  Fenglong Ma,et al.  TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data , 2018, KDD.

[19]  Robert A. Bridges,et al.  Towards a Relation Extraction Framework for Cyber-Security Concepts , 2015, CISR.

[20]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[21]  Andrew McCallum,et al.  Collective Cross-Document Relation Extraction Without Labelled Data , 2010, EMNLP.

[22]  Tudor Dumitras,et al.  Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits , 2015, USENIX Security Symposium.

[23]  Angel X. Chang,et al.  SUTime: A library for recognizing and normalizing time expressions , 2012, LREC.

[24]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, The VLDB Journal.

[25]  Karen Scarfone,et al.  Common Vulnerability Scoring System , 2006, IEEE Security & Privacy.

[26]  Ehab Al-Shaer,et al.  TTPDrill: Automatic and Accurate Extraction of Threat Actions from Unstructured Text of CTI Sources , 2017, ACSAC.

[27]  Anand Rajaraman,et al.  Building, maintaining, and using knowledge bases: a report from the trenches , 2013, SIGMOD '13.

[28]  Marco Balduzzi,et al.  Automatic Extraction of Indicators of Compromise for Web Applications , 2016, WWW.

[29]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[30]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[31]  Wenbo Guo,et al.  Towards the Detection of Inconsistencies in Public Security Vulnerability Reports , 2019, USENIX Security Symposium.

[32]  Wei Zhang,et al.  Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources , 2015, Proc. VLDB Endow..

[33]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[34]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[35]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[36]  Michael D. Iannacone,et al.  PACE: Pattern Accurate Computationally Efficient Bootstrapping for Timely Discovery of Cyber-security Concepts , 2013, 2013 12th International Conference on Machine Learning and Applications.

[37]  Jiawei Han,et al.  A Probabilistic Model for Estimating Real-valued Truth from Conflicting Sources , 2012 .