Statistical Deobfuscation of Android Applications

This work presents a new approach for deobfuscating Android APKs based on probabilistic learning of large code bases (termed "Big Code"). The key idea is to learn a probabilistic model over thousands of non-obfuscated Android applications and to use this probabilistic model to deobfuscate new, unseen Android APKs. The concrete focus of the paper is on reversing layout obfuscation, a popular transformation which renames key program elements such as classes, packages, and methods, thus making it difficult to understand what the program does. Concretely, the paper: (i) phrases the layout deobfuscation problem of Android APKs as structured prediction in a probabilistic graphical model, (ii) instantiates this model with a rich set of features and constraints that capture the Android setting, ensuring both semantic equivalence and high prediction accuracy, and (iii) shows how to leverage powerful inference and learning algorithms to achieve overall precision and scalability of the probabilistic predictions. We implemented our approach in a tool called DeGuard and used it to: (i) reverse the layout obfuscation performed by the popular ProGuard system on benign, open-source applications, (ii) predict third-party libraries imported by benign APKs (also obfuscated by ProGuard), and (iii) rename obfuscated program elements of Android malware. The experimental results indicate that DeGuard is practically effective: it recovers 79.1% of the program element names obfuscated with ProGuard, it predicts third-party libraries with accuracy of 91.3%, and it reveals string decoders and classes that handle sensitive data in Android malware.

[1]  Douglas Low,et al.  Protecting Java code via code obfuscation , 1998, CROS.

[2]  Paolo Tonella,et al.  Restructuring program identifier names , 2000, Proceedings 2000 International Conference on Software Maintenance.

[3]  Andreas Krause,et al.  Learning programs from noisy data , 2016, POPL.

[4]  Yijun Yu,et al.  Exploring the Influence of Identifier Names on Code Quality: An Empirical Study , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[5]  Jacques Klein,et al.  Combining static analysis with probabilistic models to enable market-scale Android inter-component analysis , 2016, POPL.

[6]  Daniel Tarlow,et al.  Structured Generative Models of Natural Source Code , 2014, ICML.

[7]  Nebojsa Jojic,et al.  Program verification as probabilistic inference , 2007, POPL '07.

[8]  J. Andrew Bagnell,et al.  (Approximate) Subgradient Methods for Structured Prediction , 2007, International Conference on Artificial Intelligence and Statistics.

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[11]  Benjamin Livshits,et al.  Merlin: specification inference for explicit information flow problems , 2009, PLDI '09.

[12]  Laurie Hendren,et al.  Soot: a Java bytecode optimization framework , 2010, CASCON.

[13]  Marco Pistoia,et al.  ALETHEIA: Improving the Usability of Static Security Analysis , 2014, CCS.

[14]  Yajin Zhou,et al.  Dissecting Android Malware: Characterization and Evolution , 2012, 2012 IEEE Symposium on Security and Privacy.

[15]  Viktor Kuncak,et al.  Synthesizing Java expressions from free-form queries , 2015, OOPSLA.

[16]  Dawson R. Engler,et al.  A Factor Graph Model for Software Bug Finding , 2007, IJCAI.

[17]  Nir Friedman,et al.  Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning , 2009 .

[18]  Ran El-Yaniv,et al.  Estimating types in binaries using predictive modeling , 2016, POPL.

[19]  Robert D. Macredie,et al.  The effects of comments and identifier names on program comprehensibility: an experimental investigation , 1996, J. Program. Lang..

[20]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[21]  Dawson R. Engler,et al.  From uncertainty to belief: inferring the specification within , 2006, OSDI '06.

[22]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[23]  Martin T. Vechev,et al.  Phrase-Based Statistical Translation of Programming Languages , 2014, Onward!.

[24]  Bin Ma,et al.  Following Devil's Footprints: Cross-Platform Analysis of Potentially Harmful Libraries on Android and iOS , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[25]  Dawn Xiaodong Song,et al.  Recognizing Functions in Binaries with Neural Networks , 2015, USENIX Security Symposium.

[26]  Jacques Klein,et al.  FlowDroid: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps , 2014, PLDI.

[27]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[28]  Charles A. Sutton,et al.  Learning natural coding conventions , 2014, SIGSOFT FSE.

[29]  Andrew D. Gordon,et al.  Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[30]  Byung-Gon Chun,et al.  TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones , 2010, OSDI.

[31]  Phillip A. Relf,et al.  Tool assisted identifier naming for improved software readability: an empirical study , 2005, 2005 International Symposium on Empirical Software Engineering, 2005..

[32]  Haoyu Wang,et al.  LibRadar: Fast and Accurate Detection of Third-Party Libraries in Android Apps , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).