Semi-Automated Information Extraction from Unstructured Threat Advisories

One of the fundamental challenges for information officers of most organizations today is the growing number of cyber security threats. This has led to an emerging field of Cyber Threat Intelligence, which is a mechanism to acquire, categorize and prioritize information regarding impending security threats from disparate online sources, enabling organizations to take the necessary steps to avoid compromising client data and protecting their hardware and software resources. Such information is published as formal security advisories which are largely in the form of unstructured or semi structured data. In this work we describe an approach to read large volume of such unstructured data and automatically extract useful nuggets of information like the exploit targets, techniques for the exploitation and recommended prevention guidelines. We use natural language processing techniques and a pattern identification framework to extract these information nuggets. We present some early results and observations.

[1]  AvizienisAlgirdas,et al.  Basic Concepts and Taxonomy of Dependable and Secure Computing , 2004 .

[2]  Timothy W. Finin,et al.  Extracting Cybersecurity Related Linked Data from Text , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[3]  Henrique Santos,et al.  Botnets: a heuristic-based detection framework , 2012, SIN '12.

[4]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[5]  Eric W. Burger,et al.  Taxonomy Model for Cyber Threat Intelligence Information Exchange Technologies , 2014, WISCS '14.

[6]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .

[7]  Eugene H. Spafford,et al.  The internet worm program: an analysis , 1989, CCRV.

[8]  David Black,et al.  Enumeration Reference Format for the Incident Object Description Exchange Format (IODEF) , 2015, RFC.

[9]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[10]  Hsinchun Chen Exploring Extremism and Terrorism on the Web: The Dark Web Project , 2007, PAISI.

[11]  Boris Katz,et al.  Using English for Indexing and Retrieving , 1991 .

[12]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[13]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[14]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[15]  J. Curran,et al.  Minimising semantic drift with Mutual Exclusion Bootstrapping , 2007 .

[16]  Timothy W. Finin,et al.  Extracting Information about Security Vulnerabilities from Web Text , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[17]  Xiaojie Yuan,et al.  Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches , 2010, COLING.

[18]  Victor A. Benjamin,et al.  Securing cyberspace: Identifying key actors in hacker communities , 2012, 2012 IEEE International Conference on Intelligence and Security Informatics.

[19]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[20]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[21]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[22]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[23]  James R. Curran,et al.  Weighted Mutual Exclusion Bootstrapping for Domain Independent Lexicon and Template Acquisition , 2008, ALTA.

[24]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[25]  Shanell Shanay Frazer Analyzing Security Incidents Reported by The United States Computer Emergency Readiness Team , 2015 .

[26]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[27]  Michael D. Iannacone,et al.  PACE: Pattern Accurate Computationally Efficient Bootstrapping for Timely Discovery of Cyber-security Concepts , 2013, 2013 12th International Conference on Machine Learning and Applications.

[28]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[29]  Martha Palmer,et al.  Optimization of natural language processing components for robustness and scalability , 2012 .

[30]  J. Jenkins,et al.  Word association norms , 1964 .

[31]  Stefan Fenz,et al.  Formalizing information security knowledge , 2009, ASIACCS '09.

[32]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[33]  Hsinchun Chen,et al.  A Comparison of Tools for Detecting Fake Websites , 2009, Computer.

[34]  Peter Sawyer,et al.  The Case for Dumb Requirements Engineering Tools , 2012, REFSQ.