Text mining for security threat detection discovering hidden information in unstructured log messages

The exponential growth of unstructured messages generated by the computer systems and applications in modern computing environment poses a significant challenge in managing and using the information contained in the messages. Although these data contain a wealth of information that is useful for advanced threat detection, the sheer volume, variety, and complexity of data make it difficult to analyze them even by well-trained security analysts. While conventional Security Information and Event Management (SIEM) systems provide some capability to collect, correlate, and detect certain events from structured messages, their rule-based correlation and detection algorithms fall short in utilizing the information within the unstructured messages. Our study explores the possibility of utilizing the techniques for data mining, text classification, natural language processing, and machine learning to detect security threats by extracting relevant information from various unstructured log messages collected from distributed non-homogeneous systems. The extracted features are used to run a number of experiments on the Packet Clearing House SKAION 2006 IARPA Dataset, and their prediction capability is evaluated. In comparison with the base case without feature extraction, an average of 16.73% performance gain and 84% time reduction was achieved using extracted features only, and a 23.48% performance gain was attained using both unstructured free-text messages and extracted features. The results also show a strong potential for further increase in performance by increasing size of training datasets and extracting more features from the unstructured log messages.

[1]  Michele Banko,et al.  Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing , 2001, HLT.

[2]  Kazem Taghva Identification of Sensitive Unclassified Information , 2009 .

[3]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[4]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[5]  William K. Robertson,et al.  Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks , 2013, ACSAC.

[6]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[7]  Richard Lippmann,et al.  The 1999 DARPA off-line intrusion detection evaluation , 2000, Comput. Networks.

[8]  Md. Abu Naser Bikas,et al.  An Implementation of Intrusion Detection System Using Genetic Algorithm , 2012, ArXiv.

[9]  Hila Becker,et al.  Learning similarity metrics for event identification in social media , 2010, WSDM '10.

[10]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[11]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[12]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[13]  Mitul Tiwari,et al.  Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach , 2013, Proc. VLDB Endow..

[14]  Charless C. Fowlkes,et al.  Do We Need More Training Data or Better Models for Object Detection? , 2012, BMVC.

[15]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[16]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[17]  Jing Wang,et al.  Improving Short Text Classification Using Public Search Engines , 2013, IUKM.

[18]  C. A. Murthy,et al.  Effective Text Classification by a Supervised Feature Selection Approach , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[19]  Chris Volinsky,et al.  Network-Based Marketing: Identifying Likely Adopters Via Consumer Networks , 2006, math/0606278.

[20]  Charles Elkan,et al.  Results of the KDD'99 classifier learning , 2000, SKDD.

[21]  Charu C. Aggarwal,et al.  Mining Text Data , 2012, Springer US.

[22]  Christoph Meinel,et al.  A New Approach to Building a Multi-tier Direct Access Knowledgebase for IDS/SIEM Systems , 2013, 2013 IEEE 11th International Conference on Dependable, Autonomic and Secure Computing.

[23]  Gilad Mishne,et al.  Finding high-quality content in social media , 2008, WSDM '08.

[24]  Ling Huang,et al.  Online System Problem Detection by Mining Patterns of Console Logs , 2009, 2009 Ninth IEEE International Conference on Data Mining.