Systemizing Interprocedural Static Analysis of Large-scale Systems Code with Graspan

There is more than a decade-long history of using static analysis to find bugs in systems such as Linux. Most of the existing static analyses developed for these systems are simple checkers that find bugs based on pattern matching. Despite the presence of many sophisticated interprocedural analyses, few of them have been employed to improve checkers for systems code due to their complex implementations and poor scalability. In this article, we revisit the scalability problem of interprocedural static analysis from a “Big Data” perspective. That is, we turn sophisticated code analysis into Big Data analytics and leverage novel data processing techniques to solve this traditional programming language problem. We propose Graspan, a disk-based parallel graph system that uses an edge-pair centric computation model to compute dynamic transitive closures on very large program graphs. We develop two backends for Graspan, namely, Graspan-C running on CPUs and Graspan-G on GPUs, and present their designs in the article. Graspan-C can analyze large-scale systems code on any commodity PC, while, if GPUs are available, Graspan-G can be readily used to achieve orders of magnitude speedup by harnessing a GPU’s massive parallelism. We have implemented fully context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases written in multiple languages such as Linux and Apache Hadoop demonstrates that their Graspan implementations are language-independent, scale to millions of lines of code, and are much simpler than their original implementations. Moreover, we show that these analyses can be used to uncover many real-world bugs in large-scale systems code.

[1]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[2]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[3]  Thomas W. Reps,et al.  Precise Interprocedural Dataflow Analysis with Applications to Constant Propagation , 1995, TAPSOFT.

[4]  Yannis Smaragdakis,et al.  Scalability-first pointer analysis with self-tuning context-sensitivity , 2018, ESEC/SIGSOFT FSE.

[5]  Hongseok Yang,et al.  Selective context-sensitivity guided by impact pre-analysis , 2014, PLDI.

[6]  Binyu Zang,et al.  Computation and communication efficient graph processing with distributed immutable view , 2014, HPDC '14.

[7]  YangJunfeng,et al.  An empirical study of operating systems errors , 2001 .

[8]  Monica S. Lam,et al.  Efficient context-sensitive pointer analysis for C programs , 1995, PLDI '95.

[9]  A Pnueli,et al.  Two Approaches to Interprocedural Data Flow Analysis , 2018 .

[10]  Jingling Xue,et al.  Parallel Pointer Analysis with CFL-Reachability , 2014, 2014 43rd International Conference on Parallel Processing.

[11]  Jingling Xue,et al.  Accelerating inclusion-based pointer analysis on heterogeneous CPU-GPU systems , 2013, 20th Annual International Conference on High Performance Computing.

[12]  Haibo Chen,et al.  Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration , 2016, OSDI.

[13]  Alexander S. Szalay,et al.  FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs , 2014, FAST.

[14]  Hakjoo Oh,et al.  Data-driven context-sensitivity for points-to analysis , 2017, Proc. ACM Program. Lang..

[15]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[16]  Ondrej Lhoták,et al.  Actor-Based Parallel Dataflow Analysis , 2011, CC.

[17]  Matei Ripeanu,et al.  Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems , 2013, ArXiv.

[18]  Rajeev Alur,et al.  Analysis of recursive state machines , 2001, TOPL.

[19]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[20]  Jingling Xue,et al.  Precision-preserving yet fast object-sensitive pointer analysis with partial context sensitivity , 2019, Proc. ACM Program. Lang..

[21]  Christophe Calvès,et al.  Faults in linux: ten years later , 2011, ASPLOS XVI.

[22]  MullerGilles,et al.  Faults in linux , 2011 .

[23]  Butler W. Lampson,et al.  Hints for Computer System Design , 1983, IEEE Software.

[24]  Jianlong Zhong,et al.  Medusa: Simplified Graph Processing on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[25]  Eran Yahav,et al.  Effective typestate verification in the presence of aliasing , 2006, TSEM.

[26]  Zhiqiang Zuo,et al.  Chianina: an evolving graph system for flow- and context-sensitive analyses of million lines of C code , 2021, PLDI.

[27]  Xin Zhang,et al.  On abstraction refinement for program analyses in Datalog , 2014, PLDI 2014.

[28]  Benjamin Livshits,et al.  Toward full elasticity in distributed static analysis: the case of callgraph analysis , 2017, ESEC/SIGSOFT FSE.

[29]  Monica S. Lam,et al.  SociaLite: Datalog extensions for efficient social network analysis , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[30]  Jakob Rehof,et al.  Type-base flow analysis: from polymorphic subtyping to CFL-reachability , 2001, POPL '01.

[31]  Zhendong Su,et al.  Calling-to-reference context translation via constraint-guided CFL-reachability , 2018, PLDI.

[32]  Zhendong Su,et al.  GraphQ: Graph Query Processing with Abstraction Refinement , 2015 .

[33]  Thomas W. Reps,et al.  Program analysis via graph reachability , 1997, Inf. Softw. Technol..

[34]  Rajiv Gupta,et al.  KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations , 2017, ASPLOS.

[35]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[36]  Yannis Smaragdakis,et al.  Hybrid context-sensitivity for points-to analysis , 2013, PLDI.

[37]  Monica S. Lam,et al.  Program analysis with partial transfer functions , 1999 .

[38]  Rong Gu,et al.  BigSpa: An Efficient Interprocedural Static Analysis Engine in the Cloud , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[39]  Aws Albarghouthi,et al.  Parallelizing top-down interprocedural analyses , 2012, PLDI '12.

[40]  Yannis Smaragdakis,et al.  Introspective analysis: context-sensitivity, across the board , 2014, PLDI.

[41]  Hai Jin,et al.  Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model , 2018, IEEE Transactions on Knowledge and Data Engineering.

[42]  Rongxin Wu,et al.  Pinpoint: fast and precise sparse value flow analysis for million lines of code , 2018, PLDI.

[43]  John D. Owens,et al.  Multi-GPU Graph Analytics , 2015, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[44]  Zhe Yang,et al.  Software validation via scalable path-sensitive value flow analysis , 2004, ISSTA '04.

[45]  Wenguang Chen,et al.  Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.

[46]  Peter W. O'Hearn,et al.  Moving Fast with Software Verification , 2015, NFM.

[47]  Dawson R. Engler,et al.  A system and language for building system-specific, static analyses , 2002, PLDI '02.

[48]  Zhenmin Li,et al.  PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code , 2005, ESEC/FSE-13.

[49]  Jinwook Kim,et al.  GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs , 2016, SIGMOD Conference.

[50]  Mihalis Yannakakis,et al.  Graph-theoretic methods in database theory , 1990, PODS.

[51]  Zhisong Fu,et al.  MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs , 2014, GRADES.

[52]  Thomas W. Reps,et al.  Demand interprocedural dataflow analysis , 1995, SIGSOFT FSE.

[53]  Todd Millstein,et al.  Automatic predicate abstraction of C programs , 2001, PLDI '01.

[54]  Thomas W. Reps,et al.  Solving Demand Versions of Interprocedural Analysis Problems , 1994, CC.

[55]  Ben Liblit,et al.  Defective error/pointer interactions in the Linux kernel , 2011, ISSTA '11.

[56]  Michael R. Lyu,et al.  Fast algorithms for Dyck-CFL-reachability with applications to alias analysis , 2013, PLDI.

[57]  Kai Wang,et al.  RStream: Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine , 2018, OSDI.

[58]  Weimin Zheng,et al.  Exploring the Hidden Dimension in Graph Processing , 2016, OSDI.

[59]  Sorin Lerner,et al.  ESP: path-sensitive program verification in polynomial time , 2002, PLDI '02.

[60]  Minsuk Kahng,et al.  MMap: Fast billion-scale graph computation on a PC via memory mapping , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[61]  Thomas W. Reps,et al.  Speeding up slicing , 1994, SIGSOFT '94.

[62]  Nancy M. Amato,et al.  Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[63]  Kai Wang,et al.  Graspan: A Single-machine Disk-based Graph System for Interprocedural Static Analyses of Large-scale Systems Code , 2017, ASPLOS.

[64]  Yannis Smaragdakis,et al.  Resolving and exploiting the k-CFA paradox: illuminating functional vs. object-oriented program analysis , 2010, PLDI '10.

[65]  Alexander Aiken,et al.  A Distributed Multi-GPU System for Fast Graph Processing , 2017, Proc. VLDB Endow..

[66]  Keval Vora,et al.  CuSha: vertex-centric graph processing on GPUs , 2014, HPDC '14.

[67]  Mohan Kumar,et al.  Mosaic: Processing a Trillion-Edge Graph on a Single Machine , 2017, EuroSys.

[68]  Atanas Rountev,et al.  Merging equivalent contexts for scalable heap-cloning-based context-sensitive points-to analysis , 2008, ISSTA '08.

[69]  Jinha Kim,et al.  TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC , 2013, KDD.

[70]  George C. Necula,et al.  CCured: type-safe retrofitting of legacy software , 2005, TOPL.

[71]  Barbara G. Ryder,et al.  Parameterized object sensitivity for points-to analysis for Java , 2005, TSEM.

[72]  Nicola Santoro,et al.  Min-max heaps and generalized priority queues , 1986, CACM.

[73]  Laurie J. Hendren,et al.  Context-sensitive interprocedural points-to analysis in the presence of function pointers , 1994, PLDI '94.

[74]  Rajiv Gupta,et al.  Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing , 2016, USENIX Annual Technical Conference.

[75]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[76]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[77]  Daniel M. Yellin Speeding up dynamic transitive closure for bounded degree graphs , 2005, Acta Informatica.

[78]  Carlo Zaniolo,et al.  Big Data Analytics with Datalog Queries on Spark , 2016, SIGMOD Conference.

[79]  Rajeev Alur,et al.  Visibly pushdown languages , 2004, STOC '04.

[80]  Bo Wu,et al.  Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[81]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[82]  Yannis Smaragdakis,et al.  Strictly declarative specification of sophisticated points-to analyses , 2009, OOPSLA.

[83]  Hao Tang,et al.  Summary-Based Context-Sensitive Data-Dependence Analysis in Presence of Callbacks , 2015, POPL.

[84]  Vivek Sarkar,et al.  Parallel sparse flow-sensitive points-to analysis , 2018, CC.

[85]  Alexander Aiken,et al.  Verifying the Safety of User Pointer Dereferences , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[86]  Rajeev Alur,et al.  Marrying words and trees , 2007, CSR.

[87]  Weimin Zheng,et al.  Squeezing out All the Value of Loaded Data: An Out-of-core Graph Processing System with Reduced Disk I/O , 2017, USENIX Annual Technical Conference.

[88]  Yue Zhao,et al.  Towards Ontology-Based Program Analysis , 2016, ECOOP.

[89]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[90]  Robert E. Strom,et al.  Typestate: A programming language concept for enhancing software reliability , 1986, IEEE Transactions on Software Engineering.

[91]  Refinement-based context-sensitive points-to analysis for Java , 2006, PLDI.

[92]  Wencong Xiao,et al.  GraM: scaling graph computation to the trillions , 2015, SoCC.

[93]  Kai Wang,et al.  Grapple: A Graph System for Static Finite-State Property Checking of Large-Scale Systems Code , 2019, EuroSys.

[94]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[95]  Isil Dillig,et al.  An overview of the saturn project , 2007, PASTE '07.

[96]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[97]  Sriram K. Rajamani,et al.  SLAM and Static Driver Verifier: Technology Transfer of Formal Methods inside Microsoft , 2004, IFM.

[98]  Cathrin Weiss,et al.  Database-Backed Program Analysis for Scalable Error Propagation , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[99]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[100]  Binyu Zang,et al.  PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[101]  Michael Philippsen,et al.  GPU-accelerated fixpoint algorithms for faster compiler analyses , 2019, CC.

[102]  Ciera Jaspan,et al.  Lessons from building static analysis tools at Google , 2018, Commun. ACM.

[103]  John D. Owens,et al.  Gunrock , 2017, ACM Trans. Parallel Comput..

[104]  Matei Zaharia,et al.  Resilient Distributed Datasets , 2016 .

[105]  Thomas W. Reps,et al.  Shape analysis as a generalized path problem , 1995, PEPM '95.

[106]  Karsten Schwan,et al.  GraphReduce: processing large-scale graphs on accelerator-based systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[107]  Willy Zwaenepoel,et al.  Chaos: scale-out graph processing from secondary storage , 2015, SOSP.

[108]  Atanas Rountev,et al.  Static Detection of Loop-Invariant Data Structures , 2012, ECOOP.

[109]  Uri Zwick,et al.  A fully dynamic reachability algorithm for directed graphs with an almost linear update time , 2004, STOC '04.

[110]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[111]  Manu Sridharan,et al.  Scaling CFL-Reachability-Based Points-To Analysis Using Context-Sensitive Must-Not-Alias Analysis , 2009, ECOOP.

[112]  Magdalena Balazinska,et al.  Asynchronous and Fault-Tolerant Recursive Datalog Evaluation in Shared-Nothing Engines , 2015, Proc. VLDB Endow..

[113]  Thomas W. Reps,et al.  Precise interprocedural dataflow analysis via graph reachability , 1995, POPL '95.

[114]  Rajiv Gupta,et al.  Synergistic Analysis of Evolving Graphs , 2016, ACM Trans. Archit. Code Optim..

[115]  Dawson R. Engler,et al.  How to Build Static Checking Systems Using Orders of Magnitude Less Code , 2016, ASPLOS.

[116]  Rong Gu,et al.  Towards Efficient Large-Scale Interprocedural Program Static Analysis on Distributed Data-Parallel Computation , 2021, IEEE Transactions on Parallel and Distributed Systems.

[117]  David A. Wagner,et al.  Finding User/Kernel Pointer Bugs with Type Inference , 2004, USENIX Security Symposium.

[118]  Dawson R. Engler,et al.  EXE: automatically generating inputs of death , 2006, CCS '06.

[119]  Yin Liu,et al.  Static analysis for inference of explicit information flow , 2008, PASTE '08.

[120]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[121]  Andrea C. Arpaci-Dusseau,et al.  Error propagation analysis for file systems , 2009, PLDI '09.

[122]  Yannis Smaragdakis,et al.  Set-based pre-processing for points-to analysis , 2013, OOPSLA.

[123]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[124]  Armando Solar-Lezama,et al.  Towards optimization-safe systems: analyzing the impact of undefined behavior , 2013, SOSP.

[125]  Atanas Rountev,et al.  Demand-driven context-sensitive alias analysis for Java , 2011, ISSTA '11.

[126]  Giuseppe F. Italiano,et al.  Amortized Efficiency of a Path Retrieval Data Structure , 1986, Theor. Comput. Sci..

[127]  R. Govindarajan,et al.  Parallel flow-sensitive pointer analysis by graph-rewriting , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[128]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[129]  Dawson R. Engler,et al.  Checking system rules using system-specific, programmer-written compiler extensions , 2000, OSDI.

[130]  Ondrej Lhoták,et al.  Pick your contexts well: understanding object-sensitivity , 2011, POPL '11.

[131]  Thomas W. Reps,et al.  Interconvertibility of a class of set constraints and context-free-language reachability , 2000, Theor. Comput. Sci..

[132]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[133]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[134]  Bernhard Scholz,et al.  Soufflé: On Synthesis of Program Analyzers , 2016, CAV.

[135]  Yannis Smaragdakis,et al.  Precision-guided context sensitivity for pointer analysis , 2018, Proc. ACM Program. Lang..

[136]  Hakjoo Oh,et al.  Precise and scalable points-to analysis via data-driven context tunneling , 2018, Proc. ACM Program. Lang..

[137]  Michael J. Carey,et al.  Pregelix: Big(ger) Graph Analytics on a Dataflow Engine , 2014, Proc. VLDB Endow..

[138]  Ondrej Lhoták,et al.  Scaling Java Points-to Analysis Using SPARK , 2003, CC.

[139]  Julia L. Lawall,et al.  Documenting and automating collateral evolutions in linux device drivers , 2008, Eurosys '08.