Learning a Variable-Clustering Strategy for Octagon from Labeled Data Generated by a Static Analysis

We present a method for automatically learning an effective strategy for clustering variables for the Octagon analysis from a given codebase. This learned strategy works as a preprocessor of Octagon. Given a program to be analyzed, the strategy is first applied to the program and clusters variables in it. We then run a partial variant of the Octagon analysis that tracks relationships among variables within the same cluster, but not across different clusters. The notable aspect of our learning method is that although the method is based on supervised learning, it does not require manually-labeled data. The method does not ask human to indicate which pairs of program variables in the given codebase should be tracked. Instead it uses the impact pre-analysis for Octagon from our previous work and automatically labels variable pairs in the codebase as positive or negative. We implemented our method on top of a static buffer-overflow detector for C programs and tested it against open source benchmarks. Our experiments show that the partial Octagon analysis with the learned strategy scales up to 100KLOC and is 33x faster than the one with the impact pre-analysis (which itself is significantly faster than the original Octagon analysis), while increasing false alarms by only 2 %.

[1]  Sriram Sankaranarayanan,et al.  Mining library specifications using inductive logic programming , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[2]  Swarat Chaudhuri,et al.  Dynamic inference of likely data preconditions over predicates by tree learning , 2008, ISSTA '08.

[3]  Bertrand Jeannet,et al.  Apron: A Library of Numerical Abstract Domains for Static Analysis , 2009, CAV.

[4]  Hongseok Yang,et al.  Selective context-sensitivity guided by impact pre-analysis , 2014, PLDI.

[5]  Alexander Aiken,et al.  Interpolants as Classifiers , 2012, CAV.

[6]  Rahul Sharma,et al.  Termination proofs from tests , 2013, ESEC/FSE 2013.

[7]  Alexander Aiken,et al.  A Data Driven Approach for Algebraic Loop Invariants , 2013, ESOP.

[8]  Jeehoon Kang,et al.  Global Sparse Analysis Framework , 2014, TOPL.

[9]  Jacques Klein,et al.  Combining static analysis with probabilistic models to enable market-scale Android inter-component analysis , 2016, POPL.

[10]  Antoine Miné,et al.  The octagon abstract domain , 2001, High. Order Symb. Comput..

[11]  Andreas Krause,et al.  Learning programs from noisy data , 2016, POPL.

[12]  Patrick Cousot,et al.  A static analyzer for large safety-critical software , 2003, PLDI.

[13]  Hongseok Yang,et al.  Learning a strategy for adapting a program analysis via bayesian optimisation , 2015, OOPSLA.

[14]  Dan Roth,et al.  Learning invariants using decision trees and implication counterexamples , 2016, POPL.

[15]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Xin Zhang,et al.  A user-guided approach to program analysis , 2015, ESEC/SIGSOFT FSE.

[18]  Hakjoo Oh,et al.  Design and implementation of sparse global analyses for C-like languages , 2012, PLDI.

[19]  Alexander Aiken,et al.  Verification as Learning Geometric Concepts , 2013, SAS.

[20]  Markus Püschel,et al.  Making numerical program analysis fast , 2015, PLDI.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Nicolas Halbwachs,et al.  Automatic discovery of linear restraints among variables of a program , 1978, POPL.

[23]  Guillaume Brat,et al.  Precise and efficient static array bound checking for large embedded C programs , 2004, PLDI '04.

[24]  Hosik Choi,et al.  An empirical study on classification methods for alarms from a bug-finding static C analyzer , 2007, Inf. Process. Lett..

[25]  Radu Grigore,et al.  Abstraction refinement guided by a learnt probabilistic model , 2015, POPL.