Graspan: A Single-machine Disk-based Graph System for Interprocedural Static Analyses of Large-scale Systems Code

There is more than a decade-long history of using static analysis to find bugs in systems such as Linux. Most of the existing static analyses developed for these systems are simple checkers that find bugs based on pattern matching. Despite the presence of many sophisticated interprocedural analyses, few of them have been employed to improve checkers for systems code due to their complex implementations and poor scalability. In this paper, we revisit the scalability problem of interprocedural static analysis from a "Big Data" perspective. That is, we turn sophisticated code analysis into Big Data analytics and leverage novel data processing techniques to solve this traditional programming language problem. We develop Graspan, a disk-based parallel graph system that uses an edge-pair centric computation model to compute dynamic transitive closures on very large program graphs. We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations. Moreover, we show that these analyses can be used to augment the existing checkers; these augmented checkers uncovered 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.

[1]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[2]  Nicola Santoro,et al.  Min-max heaps and generalized priority queues , 1986, CACM.

[3]  Vikram S. Adve,et al.  Making context-sensitive points-to analysis with heap cloning practical for the real world , 2007, PLDI '07.

[4]  Alexander S. Szalay,et al.  FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs , 2014, FAST.

[5]  David A. Wagner,et al.  Finding User/Kernel Pointer Bugs with Type Inference , 2004, USENIX Security Symposium.

[6]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[7]  Luke M. Leslie,et al.  Zorro: zero-cost reactive failure recovery in distributed graph processing , 2015, SoCC.

[8]  Binyu Zang,et al.  Computation and communication efficient graph processing with distributed immutable view , 2014, HPDC '14.

[9]  Thomas W. Reps,et al.  Solving Demand Versions of Interprocedural Analysis Problems , 1994, CC.

[10]  Rajiv Gupta,et al.  KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations , 2017, ASPLOS.

[11]  Thomas W. Reps,et al.  Interconvertibility of a class of set constraints and context-free-language reachability , 2000, Theor. Comput. Sci..

[12]  Ondrej Lhoták,et al.  Pick your contexts well: understanding object-sensitivity , 2011, POPL '11.

[13]  Manu Sridharan,et al.  Scaling CFL-Reachability-Based Points-To Analysis Using Context-Sensitive Must-Not-Alias Analysis , 2009, ECOOP.

[14]  Thomas W. Reps,et al.  Precise interprocedural dataflow analysis via graph reachability , 1995, POPL '95.

[15]  Rajiv Gupta,et al.  Synergistic Analysis of Evolving Graphs , 2016, ACM Trans. Archit. Code Optim..

[16]  Minsuk Kahng,et al.  MMap: Fast billion-scale graph computation on a PC via memory mapping , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[17]  Julia L. Lawall,et al.  Documenting and automating collateral evolutions in linux device drivers , 2008, Eurosys '08.

[18]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[19]  Yannis Smaragdakis,et al.  Hybrid context-sensitivity for points-to analysis , 2013, PLDI.

[20]  Monica S. Lam,et al.  Cloning-based context-sensitive pointer alias analysis using binary decision diagrams , 2004, PLDI '04.

[21]  Rajeev Alur,et al.  Analysis of recursive state machines , 2001, TOPL.

[22]  Monica S. Lam,et al.  SociaLite: Datalog extensions for efficient social network analysis , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[23]  Willy Zwaenepoel,et al.  Chaos: scale-out graph processing from secondary storage , 2015, SOSP.

[24]  Zhe Yang,et al.  Software validation via scalable path-sensitive value flow analysis , 2004, ISSTA '04.

[25]  Alexander Aiken,et al.  Regularly annotated set constraints , 2007, PLDI '07.

[26]  Alexander Aiken,et al.  Verifying the Safety of User Pointer Dereferences , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[27]  Magdalena Balazinska,et al.  Asynchronous and Fault-Tolerant Recursive Datalog Evaluation in Shared-Nothing Engines , 2015, Proc. VLDB Endow..

[28]  Isil Dillig,et al.  An overview of the saturn project , 2007, PASTE '07.

[29]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[30]  Zhendong Su,et al.  Context-sensitive data-dependence analysis via linear conjunctive language reachability , 2017, POPL.

[31]  Zhendong Su,et al.  GraphQ: Graph Query Processing with Abstraction Refinement , 2015 .

[32]  Yannis Smaragdakis,et al.  Introspective analysis: context-sensitivity, across the board , 2014, PLDI.

[33]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[34]  Armando Solar-Lezama,et al.  Towards optimization-safe systems: analyzing the impact of undefined behavior , 2013, SOSP.

[35]  Alexander Aiken,et al.  A theory of type qualifiers , 1999, PLDI '99.

[36]  Andrea C. Arpaci-Dusseau,et al.  Error propagation analysis for file systems , 2009, PLDI '09.

[37]  Eran Yahav,et al.  Effective typestate verification in the presence of aliasing , 2006, TSEM.

[38]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[39]  Daniel M. Yellin Speeding up dynamic transitive closure for bounded degree graphs , 2005, Acta Informatica.

[40]  Thomas W. Reps,et al.  Precise Interprocedural Dataflow Analysis with Applications to Constant Propagation , 1995, TAPSOFT.

[41]  Thomas W. Reps,et al.  Pointer analysis for programs with structures and casting , 1999, PLDI '99.

[42]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[43]  Kai Wang,et al.  GraphQ: Graph Query Processing with Abstraction Refinement - Scalable and Programmable Analytics over Very Large Graphs on a Single PC , 2015, USENIX Annual Technical Conference.

[44]  Robert DeLine,et al.  Enforcing high-level protocols in low-level software , 2001, PLDI '01.

[45]  Carlo Zaniolo,et al.  Big Data Analytics with Datalog Queries on Spark , 2016, SIGMOD Conference.

[46]  Yin Liu,et al.  Static analysis for inference of explicit information flow , 2008, PASTE '08.

[47]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[48]  Hao Tang,et al.  Summary-Based Context-Sensitive Data-Dependence Analysis in Presence of Callbacks , 2015, POPL.

[49]  Mihalis Yannakakis,et al.  Graph-theoretic methods in database theory , 1990, PODS.

[50]  Atanas Rountev,et al.  Demand-driven context-sensitive alias analysis for Java , 2011, ISSTA '11.

[51]  George C. Necula,et al.  CCured: type-safe retrofitting of legacy software , 2005, TOPL.

[52]  Alexander Aiken,et al.  The set constraint/CFL reachability connection in practice , 2004, PLDI '04.

[53]  Todd Millstein,et al.  Automatic predicate abstraction of C programs , 2001, PLDI '01.

[54]  Giuseppe F. Italiano,et al.  Amortized Efficiency of a Path Retrieval Data Structure , 1986, Theor. Comput. Sci..

[55]  Rajiv Gupta,et al.  Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing , 2016, USENIX Annual Technical Conference.

[56]  Thomas W. Reps,et al.  Speeding up slicing , 1994, SIGSOFT '94.

[57]  Thomas W. Reps,et al.  Shape analysis as a generalized path problem , 1995, PEPM '95.

[58]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[59]  Xin Zheng,et al.  Demand-driven alias analysis for C , 2008, POPL '08.

[60]  Dawson R. Engler,et al.  How to Build Static Checking Systems Using Orders of Magnitude Less Code , 2016, ASPLOS.

[61]  Binyu Zang,et al.  PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[62]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[63]  Thomas W. Reps,et al.  Program analysis via graph reachability , 1997, Inf. Softw. Technol..

[64]  Zhendong Su,et al.  Fast algorithms for Dyck-CFL-reachability with applications to alias analysis , 2013, PLDI.

[65]  Sriram K. Rajamani,et al.  SLAM and Static Driver Verifier: Technology Transfer of Formal Methods inside Microsoft , 2004, IFM.

[66]  Manu Sridharan,et al.  Refinement-based context-sensitive points-to analysis for Java , 2006, PLDI '06.

[67]  Jinha Kim,et al.  TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC , 2013, KDD.

[68]  Alexander Aiken,et al.  Specification Inference Using Context-Free Language Reachability , 2015, POPL.

[69]  Butler W. Lampson,et al.  Hints for Computer System Design , 1983, IEEE Software.

[70]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[71]  Zhendong Su,et al.  Efficient subcubic alias analysis for C , 2014, OOPSLA 2014.

[72]  Rajeev Alur Marrying Words and Trees , 2007, CSR.

[73]  Dawson R. Engler,et al.  EXE: automatically generating inputs of death , 2006, CCS '06.

[74]  Christophe Calvès,et al.  Faults in linux: ten years later , 2011, ASPLOS XVI.

[75]  A Pnueli,et al.  Two Approaches to Interprocedural Data Flow Analysis , 2018 .

[76]  Jakob Rehof,et al.  Type-base flow analysis: from polymorphic subtyping to CFL-reachability , 2001, POPL '01.

[77]  Thomas W. Reps,et al.  Demand interprocedural dataflow analysis , 1995, SIGSOFT FSE.

[78]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[79]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[80]  Zhenmin Li,et al.  PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code , 2005, ESEC/FSE-13.

[81]  Manu Sridharan,et al.  Demand-driven points-to analysis for Java , 2005, OOPSLA '05.

[82]  Johannes Gehrke,et al.  Asynchronous Large-Scale Graph Processing Made Easy , 2013, CIDR.

[83]  Dawson R. Engler,et al.  A system and language for building system-specific, static analyses , 2002, PLDI '02.

[84]  Cathrin Weiss,et al.  Database-Backed Program Analysis for Scalable Error Propagation , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[85]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[86]  Uri Zwick,et al.  A fully dynamic reachability algorithm for directed graphs with an almost linear update time , 2004, STOC '04.

[87]  Michael Hind,et al.  Pointer analysis: haven't we solved this problem yet? , 2001, PASTE '01.

[88]  Ben Liblit,et al.  Defective error/pointer interactions in the Linux kernel , 2011, ISSTA '11.

[89]  Dawson R. Engler,et al.  Checking system rules using system-specific, programmer-written compiler extensions , 2000, OSDI.

[90]  Uri Zwick,et al.  A Fully Dynamic Reachability Algorithm for Directed Graphs with an Almost Linear Update Time , 2016, SIAM J. Comput..

[91]  Rajiv Gupta,et al.  ASPIRE: exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based DSM , 2014, OOPSLA.

[92]  Rajeev Alur,et al.  Visibly pushdown languages , 2004, STOC '04.

[93]  Yannis Smaragdakis,et al.  Strictly declarative specification of sophisticated points-to analyses , 2009, OOPSLA '09.

[94]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.