Instant message classification in Finnish cyber security themed free-form discussion

Instant messaging enables rapid collaboration between professionals during cyber security incidents. However, monitoring discussion manually becomes challenging as the number of communication channels increases. Failure to identify relevant information from the free-form instant messages may lead to reduced situational awareness. In this paper, the problem was approached by developing a framework for classification of instant message topics of cyber security-themed discussion in Finnish. The program utilizes open source software components in morphological analysis, and subsequently converts the messages into Bag-of-Words representations before classifying them into predetermined incident categories. We compared support vector machines (SVM), multinomial naïve Bayes, and complement naïve Bayes (CNB) classification methods with five-fold cross-validation. A combination of SVM and CNB achieved classification accuracy of over 85 %, while multiclass SVM achieved 87 % accuracy. The implemented program recognizes cyber security-related messages in IRC chat rooms and categorizes them accordingly.

[1]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[2]  Ling Liu,et al.  Encyclopedia of Database Systems , 2009, Encyclopedia of Database Systems.

[3]  Levent Özgür,et al.  Text Categorization with Class-Based and Corpus-Based Keyword Selection , 2005, ISCIS.

[4]  Daniel I. Manes,et al.  Military Situation Awareness: Facilitating Critical Event Detection in Chat , 2006 .

[5]  Ahmed Hassan Awadallah,et al.  Improved Nearest Neighbor Methods For Text Classification With Language Modeling and Harmonic Functions , 2008 .

[6]  Jouko Vankka,et al.  Instant Message Classification in Finnish Cyber Security Themed Free-Form Discussion , 2016, Int. J. Cyber Situational Aware..

[7]  Tapio Salakoski,et al.  Building the essential resources for Finnish: the Turku Dependency Treebank , 2013, Language Resources and Evaluation.

[8]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[9]  Tommi A. Pirinen,et al.  Weighted Finite-State Methods for Spell-Checking and Correction , 2014 .

[10]  Randy Jensen,et al.  Untangling Topic Threads in Chat-Based Communication: A Case Study , 2011, Analyzing Microtext.

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  Hannu Toivonen,et al.  Software Newsroom – an approach to automation of news search and editing , 2013 .

[13]  J. V. Rauff,et al.  Finite State Morphology , 2007 .

[14]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[15]  S. Sathiya Keerthi,et al.  Which Is the Best Multiclass SVM Method? An Empirical Study , 2005, Multiple Classifier Systems.

[16]  Latifa Ben Arfa Rabai,et al.  Classification of Security Threats in Information Systems , 2014, ANT/SEIT.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[19]  Craig H. Martell,et al.  Lexical and Discourse Analysis of Online Chat Dialog , 2007, International Conference on Semantic Computing (ICSC 2007).

[20]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[21]  Siu Cheung Hui,et al.  Structural analysis of chat messages for topic detection , 2006, Online Inf. Rev..

[22]  Özcan Özyurt,et al.  Chat mining: Automatically determination of chat conversations' topic in Turkish text based chat mediums , 2010, Expert Syst. Appl..

[23]  Craig H. Martell,et al.  Topic Detection and Extraction in Chat , 2008, 2008 IEEE International Conference on Semantic Computing.

[24]  Joel S. Warm,et al.  Evaluation Tools to Aid Command and Control Operators in Chat-Based Communication Monitoring , 2011 .

[25]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.