A scalable learning algorithm for data-driven program analysis

Abstract Context: Recently data-driven program analysis has emerged as a promising approach for building cost-effective static analyzers. The ideal static analyzer should apply accurate but costly techniques only when they benefit. However, designing such a strategy for real-world programs is highly nontrivial and requires labor-intensive work. The goal of data-driven program analysis is to automate this process by learning the strategy from data through a learning algorithm. Objective: Current learning algorithms for data-driven program analysis are not scalable enough to be used with large codebases. The objective of this paper is to overcome this shortcoming and present a new algorithm that is able to efficiently learn a strategy from large codebases. Method: The key idea is to use an oracle and transform the existing blackbox learning problem into a whitebox one that is much easier to solve. The oracle quantifies the relative importance of each part of the program with respect to the analysis precision. The oracle can be obtained by running the most and least precise analyses only once over the codebase. Results: Our learning algorithm is much faster than the existing algorithms while producing high quality strategies. The evaluation is done with 140 open-source C programs, comprising of 2.1 MLoC in total. Learning at this large scale was previously impractical. Conclusion: Our work advances the state-of-the-art of data-driven program analysis by addressing the scalability issue of the existing learning algorithm. Our technique will make the data-driven approach more practical in the real-world.

[1]  Manu Sridharan,et al.  Refinement-based context-sensitive points-to analysis for Java , 2006, PLDI '06.

[2]  Hongseok Yang,et al.  Learning a Variable-Clustering Strategy for Octagon from Labeled Data Generated by a Static Analysis , 2016, SAS.

[3]  Manu Sridharan,et al.  Demand-driven points-to analysis for Java , 2005, OOPSLA '05.

[4]  Jingling Xue,et al.  Making k-Object-Sensitive Pointer Analysis More Precise with Still k-Limiting , 2016, SAS.

[5]  Antoine Miné,et al.  The octagon abstract domain , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[6]  Patrick Cousot,et al.  Combination of Abstractions in the ASTRÉE Static Analyzer , 2006, ASIAN.

[7]  Hakjoo Oh,et al.  Design and implementation of sparse global analyses for C-like languages , 2012, PLDI.

[8]  Hakjoo Oh,et al.  Widening with thresholds via binary search , 2016, Softw. Pract. Exp..

[9]  Hakjoo Oh,et al.  Machine-Learning-Guided Selectively Unsound Static Analysis , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[10]  Xin Zhang,et al.  Finding optimum abstractions in parametric dataflow analysis , 2013, PLDI 2013.

[11]  Olivier Tardieu,et al.  Demand-driven pointer analysis , 2001, PLDI '01.

[12]  Hongseok Yang,et al.  Learning a strategy for adapting a program analysis via bayesian optimisation , 2015, OOPSLA.

[13]  Radu Grigore,et al.  Abstraction refinement guided by a learnt probabilistic model , 2015, POPL.

[14]  Patrick Cousot,et al.  Design and Implementation of a Special-Purpose Static Program Analyzer for Safety-Critical Real-Time Embedded Software , 2002, The Essence of Computation.

[15]  Xin Zhang,et al.  On abstraction refinement for program analyses in Datalog , 2014, PLDI 2014.

[16]  Yannis Smaragdakis,et al.  Introspective analysis: context-sensitivity, across the board , 2014, PLDI.

[17]  Calvin Lin,et al.  Client-Driven Pointer Analysis , 2003, SAS.

[18]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[19]  Patrick Cousot,et al.  A static analyzer for large safety-critical software , 2003, PLDI '03.

[20]  Hakjoo Oh,et al.  Data-driven context-sensitivity for points-to analysis , 2017, Proc. ACM Program. Lang..

[21]  Mayur Naik,et al.  Learning minimal abstractions , 2011, POPL '11.

[22]  Hongseok Yang,et al.  Automatically generating features for learning program analysis heuristics for C-like languages , 2017, Proc. ACM Program. Lang..

[23]  Hongseok Yang,et al.  Selective context-sensitivity guided by impact pre-analysis , 2014, PLDI.

[24]  Patrick Cousot,et al.  Why does Astrée scale up? , 2009, Formal Methods Syst. Des..

[25]  Nicolas Halbwachs,et al.  Verification of Real-Time Systems using Linear Relation Analysis , 1997, Formal Methods Syst. Des..

[26]  Yannis Smaragdakis,et al.  Hybrid context-sensitivity for points-to analysis , 2013, PLDI.

[27]  Hakjoo Oh,et al.  Learning a Strategy for Choosing Widening Thresholds from a Large Codebase , 2016, APLAS.