Automatic classification of object code using machine learning

Recent research has repeatedly shown that machine learning techniques can be applied to either whole files or file fragments to classify them for analysis. We build upon these techniques to show that for samples of un-labeled compiled computer object code, one can apply the same type of analysis to classify important aspects of the code, such as its target architecture and endianess. We show that using simple byte-value histograms we retain enough information about the opcodes within a sample to classify the target architecture with high accuracy, and then discuss heuristic-based features that exploit information within the operands to determine endianess. We introduce a dataset with over 16000 code samples from 20 architectures and experimentally show that by using our features, classifiers can achieve very high accuracy with relatively small sample sizes.

[1]  Ponnuthurai N. Suganthan,et al.  A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification , 2010, SAISMC.

[2]  Minghe Sun,et al.  Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification , 2013, IEEE Transactions on Information Forensics and Security.

[3]  Colin Morris,et al.  Using NLP techniques for file fragment classification , 2012, Digit. Investig..

[4]  Rossilawati Sulaiman,et al.  BYTE FREQUENCY ANALYSIS DESCRIPTOR WITH SPATIAL INFORMATION FOR FILE FRAGMENT CLASSIFICATION , 2013 .

[5]  Karl A Sickendick File Carving and Malware Identification Algorithms Applied to Firmware Reverse Engineering , 2013 .

[6]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[7]  Karthikeyan Sankaralingam,et al.  A Detailed Analysis of Contemporary ARM and x86 Architectures , 2013 .

[8]  Mohammad Hossain Heydari,et al.  Content based file type detection algorithms , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  M. Masrom,et al.  Opcodes histogram for classifying metamorphic portable executables malware , 2012, 2012 International Conference on E-Learning and E-Technologies in Education (ICEEE).

[11]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.