Computer activity learning from system call time series

Using a previously introduced similarity function for the stream of system calls generated by a computer, we engineer a program-in-execution classifier using deep learning methods. Tested on malware classification, it significantly outperforms current state of the art. We provide a series of performance measures and tests to demonstrate the capabilities, including measurements from production use. We show how the system scales linearly with the number of endpoints. With the system we estimate the total number of malware families created over the last 10 years as 3450, in line with reasonable economic constraints. The more limited rate for new malware families than previously acknowledged implies that machine learning malware classifiers risk being tested on their training set; we achieve F1 = 0.995 in a test carefully designed to mitigate this risk.

[1]  Luis Enrique Correa da Rocha,et al.  The meta book and size-dependent properties of written language , 2009, ArXiv.

[2]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[3]  Alan S. Perelson,et al.  Self-nonself discrimination in a computer , 1994, Proceedings of 1994 IEEE Computer Society Symposium on Research in Security and Privacy.

[4]  Christopher Krügel,et al.  A quantitative study of accuracy in system call-based malware detection , 2012, ISSTA 2012.

[5]  Stephanie Forrest,et al.  A sense of self for Unix processes , 1996, Proceedings 1996 IEEE Symposium on Security and Privacy.

[6]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[7]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[8]  Razvan Pascanu,et al.  Malware classification with recurrent networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[10]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[11]  Roy T. Fielding,et al.  Principled design of the modern Web architecture , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[12]  Arun Kumar Sangaiah,et al.  Android malware detection based on system call sequences and LSTM , 2019, Multimedia Tools and Applications.

[13]  Jack W. Stokes,et al.  Malware classification with LSTM and GRU language models and a character-level CNN , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Alon Zakai,et al.  Consistency and Localizability , 2009, J. Mach. Learn. Res..

[15]  Ronald L. Rivest,et al.  FlipIt: The Game of “Stealthy Takeover” , 2012, Journal of Cryptology.

[16]  Claudia Eckert,et al.  Deep Learning for Classification of Malware System Call Sequences , 2016, Australasian Conference on Artificial Intelligence.

[17]  Juan Caballero,et al.  AVclass: A Tool for Massive Malware Labeling , 2016, RAID.

[18]  J. Bunge,et al.  Estimating the Number of Species: A Review , 1993 .

[19]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[20]  Sebastian Bernhardsson,et al.  Zipf's law unzipped , 2011, ArXiv.

[21]  Xiangyu Zhang,et al.  IntroLib: Efficient and transparent library call introspection for malware forensics , 2012 .

[22]  Jesse D. Kornblum Identifying almost identical files using context triggered piecewise hashing , 2006, Digit. Investig..

[23]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[24]  Ram Shankar Siva Kumar,et al.  Practical Machine Learning for Cloud Intrusion Detection: Challenges and the Way Forward , 2017, AISec@CCS.

[25]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[26]  Jensen,et al.  Fractal measures and their singularities: The characterization of strange sets. , 1987, Physical review. A, General physics.

[27]  Joshua Saxe,et al.  Malware Similarity Identification Using Call Graph Based System Call Subsequence Features , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems Workshops.