Evaluating randomness in cyber attack textual artifacts

Textual data indicators can provide valuable insight to identify potential malicious activity. There are various scenarios where cyber attacks will leave textual clues, examples include domain names, keys/passwords, text strings that are encoded in program files, etc. Several techniques can be used to evaluate if these textual clues provide useful information for the purpose of detecting attacks. In this paper, we aim at finding out whether the textual data can be considered human generated or randomly generated through computer algorithms. Here we specifically consider textual artifacts of filenames. As dropping/copying/creating files with randomly-generated filenames is a common behavior of malware, detecting this behavior through detecting randomly-generated filenames would help identifying a cyber attack. For this purpose, we discuss several features designed to differentiate randomly generated text from human generated text, where text is a filename, and then we build a classification model based on these features. On test data of 1 mil human-generated file names and 1 mil randomly generated filenames, our model gets an accuracy of 98.2940% in classifying human-generated filenames, and an accuracy of 97.8378% in classifying randomly generated filenames.

[1]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[2]  Igor Santos,et al.  Anomaly Detection Using String Analysis for Android Malware Detection , 2013, SOCO-CISIS-ICEUTE.

[3]  Claus-Peter Schnorr,et al.  Process complexity and effective random tests , 1973 .

[4]  Claude E. Shannon,et al.  The Mathematical Theory of Communication , 1950 .

[5]  Sandeep Yadav,et al.  Detecting algorithmically generated malicious domain names , 2010, IMC '10.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Robert Lyda,et al.  Using Entropy Analysis to Find Encrypted and Packed Malware , 2007, IEEE Security & Privacy.

[8]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[9]  Lynn Batten,et al.  Classification of Malware Based on String and Function Feature Selection , 2010, 2010 Second Cybercrime and Trustworthy Computing Workshop.

[10]  Andy Liaw,et al.  Variable Selection in Random Forest with Application to Quantitative Structure-Activity Relationship , 2003 .

[11]  A. Shiryayev On Tables of Random Numbers , 1993 .

[12]  Muhammad Zubair Shafiq,et al.  Embedded Malware Detection Using Markov n-Grams , 2008, DIMVA.

[13]  Xuan Zhao,et al.  Shock Study With An Extended-Mhd Model Using A Positivity-Preserving Semi-Implicit Discontinuous Galerkin Scheme , 2015 .

[14]  Zhenyu Zhong,et al.  Mining DNS for malicious domain registrations , 2010, 6th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2010).

[15]  Xuan Zhao,et al.  A positivity-preserving semi-implicit discontinuous Galerkin scheme for solving extended magnetohydrodynamics equations , 2014, J. Comput. Phys..

[16]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[17]  Salvatore J. Stolfo,et al.  Towards Stealthy Malware Detection , 2007, Malware Detection.

[18]  Adi Shamir,et al.  Playing "Hide and Seek" with Stored Keys , 1999, Financial Cryptography.

[19]  Xuan Zhao,et al.  Computational extended magneto-hydrodynamical study of shock structure generated by flows past an obstacle , 2015 .