Obfuscation resilient search through executable classification

Android applications are usually obfuscated before release, making it difficult to analyze them for malware presence or intellectual property violations. Obfuscators might hide the true intent of code by renaming variables and/or modifying program structures. It is challenging to search for executables relevant to an obfuscated application for developers to analyze efficiently. Prior approaches toward obfuscation resilient search have relied on certain structural parts of apps remaining as landmarks, un-touched by obfuscation. For instance, some prior approaches have assumed that the structural relationships between identifiers are not broken by obfuscators; others have assumed that control flow graphs maintain their structures. Both approaches can be easily defeated by a motivated obfuscator. We present a new approach, MACNETO, to search for programs relevant to obfuscated executables leveraging deep learning and principal components on instructions. MACNETO makes few assumptions about the kinds of modifications that an obfuscator might perform. We show that it has high search precision for executables obfuscated by a state-of-the-art obfuscator that changes control flow. Further, we also demonstrate the potential of MACNETO to help developers understand executables, where MACNETO infers keywords (which are from relevant un-obfuscated programs) for obfuscated executables.

[1]  François Chollet,et al.  Keras: The Python Deep Learning library , 2018 .

[2]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[3]  Yang Liu,et al.  Semantic modelling of Android malware for effective malware comprehension, detection, and classification , 2016, ISSTA.

[4]  Collin McMillan,et al.  On using machine learning to automatically classify software applications into domain categories , 2014, Empirical Software Engineering.

[5]  Tao Xie,et al.  AppContext: Differentiating Malicious and Benign Mobile App Behaviors Using Context , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[6]  Zhendong Su,et al.  Automatic mining of functionally equivalent code fragments via random testing , 2009, ISSTA.

[7]  Elmar Jürgens,et al.  Code Similarities Beyond Copy & Paste , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[8]  Robert C. Martin Clean Code - a Handbook of Agile Software Craftsmanship , 2008 .

[9]  Charles A. Sutton,et al.  Learning natural coding conventions , 2014, SIGSOFT FSE.

[10]  David Schuler,et al.  A dynamic birthmark for java , 2007, ASE.

[11]  Gail E. Kaiser,et al.  Code relatives: detecting similarly behaving software , 2016, SIGSOFT FSE.

[12]  Akito Monden,et al.  Design and evaluation of birthmarks for detecting theft of java programs , 2004, IASTED Conf. on Software Engineering.

[13]  Yijun Yu,et al.  Investigating naming convention adherence in Java references , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[14]  Douglas Low,et al.  Protecting Java code via code obfuscation , 1998, CROS.

[15]  Erik Derr,et al.  Reliable Third-Party Library Detection in Android and its Security Applications , 2016, CCS.

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Khaled Yakdan,et al.  discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code , 2016, NDSS.

[18]  吉田 則裕,et al.  Android Open Source Projectを対象としたパッチレビュー活動の調査 , 2012 .

[19]  A. Azzouz 2011 , 2020, City.

[20]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[21]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[22]  Giuliano Antoniol,et al.  Linguistic antipatterns: what they are and how developers perceive them , 2015, Empirical Software Engineering.

[23]  Gail E. Kaiser,et al.  Identifying functionally similar code in complex codebases , 2016, 2016 IEEE 24th International Conference on Program Comprehension (ICPC).

[24]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[25]  Alessandra Gorla,et al.  Mining Apps for Abnormal Usage of Sensitive Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[26]  Saumya Debray,et al.  A Generic Approach to Automatic Deobfuscation of Executable Code , 2015, 2015 IEEE Symposium on Security and Privacy.

[27]  Robert D. Macredie,et al.  The effects of comments and identifier names on program comprehensibility: an experimental investigation , 1996, J. Program. Lang..

[28]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[29]  David Clark,et al.  Similarity of Source Code in the Presence of Pervasive Modifications , 2016, 2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[30]  Emad Shihab,et al.  CCCD: Concolic code clone detection , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[31]  Christian S. Collberg,et al.  Watermarking, Tamper-Proofing, and Obfuscation-Tools for Software Protection , 2002, IEEE Trans. Software Eng..

[32]  Michael Wojnowicz,et al.  Towards Generic Deobfuscation of Windows API Calls , 2018, ArXiv.

[33]  Jacques Klein,et al.  FlowDroid: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps , 2014, PLDI.

[34]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[35]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[36]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[37]  Martín Abadi,et al.  A computational model for TensorFlow: an introduction , 2017, MAPL@PLDI.

[38]  Xuxian Jiang,et al.  Catch Me If You Can: Evaluating Android Anti-Malware Against Transformation Attacks , 2014, IEEE Transactions on Information Forensics and Security.

[39]  Nikolai Tillmann,et al.  Measuring Code Behavioral Similarity for Programming and Software Engineering Education , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[40]  Einar W. Høst,et al.  Debugging Method Names , 2009, ECOOP.

[41]  Petar Tsankov,et al.  Statistical Deobfuscation of Android Applications , 2016, CCS.

[42]  David W. Binkley,et al.  What’s in a Name? A Study of Identifiers , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[43]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[44]  Collin McMillan,et al.  Detecting similar software applications , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[45]  Andrew Begel,et al.  Cognitive Perspectives on the Role of Naming in Computer Programs , 2006, PPIG.

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.