SuperC: parsing all of C by taming the preprocessor

C tools, such as source browsers, bug finders, and automated refactorings, need to process two languages: C itself and the preprocessor. The latter improves expressivity through file includes, macros, and static conditionals. But it operates only on tokens, making it hard to even parse both languages. This paper presents a complete, performant solution to this problem. First, a configuration-preserving preprocessor resolves includes and macros yet leaves static conditionals intact, thus preserving a program's variability. To ensure completeness, we analyze all interactions between preprocessor features and identify techniques for correctly handling them. Second, a configuration-preserving parser generates a well-formed AST with static choice nodes for conditionals. It forks new subparsers when encountering static conditionals and merges them again after the conditionals. To ensure performance, we present a simple algorithm for table-driven Fork-Merge LR parsing and four novel optimizations. We demonstrate the effectiveness of our approach on the x86 Linux kernel.

[1]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[2]  Michael D. Ernst,et al.  An Empirical Analysis of C Preprocessor Use , 2002, IEEE Trans. Software Eng..

[3]  Donald E. Knuth,et al.  On the Translation of Languages from Left to Right , 1965, Inf. Control..

[4]  Daniel J. Rosenkrantz,et al.  Properties of deterministic top down grammars , 1969, STOC.

[5]  George C. Necula,et al.  Elkhound: A Fast, Practical GLR Parser Generator , 2003, CC.

[6]  Eric A. Brewer,et al.  ASTEC: a new approach to refactoring C , 2005, ESEC/FSE-13.

[7]  Christian Dietrich,et al.  Configuration coverage in the analysis of large-scale system software , 2011, PLOS '11.

[8]  Terence Parr,et al.  LL(*): the foundation of the ANTLR parser generator , 2011, PLDI '11.

[9]  Eelco Visser,et al.  Concrete syntax for objects: domain-specific language embedding and assimilation without restrictions , 2004, OOPSLA '04.

[10]  Robert Grimm,et al.  Better extensibility through modular syntax , 2006, PLDI '06.

[11]  Sebastian Erdweg,et al.  Variability-aware parsing in the presence of lexical macros and conditional compilation , 2011, OOPSLA '11.

[12]  Ira D. Baxter,et al.  Preprocessor conditional removal by simple partial evaluation , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[13]  BravenboerMartin,et al.  Concrete syntax for objects , 2004 .

[14]  Robert W. Bowdidge,et al.  Performance Trade-offs Implementing Refactoring Support for Objective-C , 2009 .

[15]  Frank DeRemer,et al.  Efficient computation of LALR(1) look-ahead sets , 2004, SIGP.

[16]  Eelco Visser,et al.  Syntax definition for language prototyping , 1997 .

[17]  Mikkel Thorup Equivalence between priority queues and sorting , 2007, JACM.

[18]  Christian Kästner,et al.  Partial preprocessing C code for variability analysis , 2011, VaMoS.

[19]  Kenn R. Luecke,et al.  Reengineering C++ Component Models via Automatic Program Transformation , 2005, WCRE.

[20]  Peter Sommerlad,et al.  Refactoring support for the C++ development tooling , 2007, OOPSLA '07.

[21]  Jean-Marie Favre Understanding-in-the-large , 1997, Proceedings Fifth International Workshop on Program Comprehension. IWPC'97.

[22]  M. Tomita Generalized LR Parsing , 1991, Springer US.

[23]  Marian Vittek Refactoring browser with preprocessor , 2003, Seventh European Conference onSoftware Maintenance and Reengineering, 2003. Proceedings..

[24]  Diomidis Spinellis,et al.  Global Analysis and Transformations in Preprocessed Languages , 2003, IEEE Trans. Software Eng..

[25]  Yoann Padioleau,et al.  Parsing C/C++ Code without Pre-processing , 2009, CC.

[26]  David Notkin,et al.  A framework for preprocessor-aware C source code analyses , 2000 .

[27]  Jeffrey D. Ullman,et al.  Parsing Algorithms with Backtrack , 1970, SWAT.

[28]  Laurie J. Hendren,et al.  SableCC, an object-oriented compiler framework , 1998, Proceedings. Technology of Object-Oriented Languages. TOOLS 26 (Cat. No.98EX176).

[29]  Kenn R. Luecke,et al.  Re-engineering C++ component models via automatic program transformation , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[30]  Ralph E. Johnson,et al.  Analyzing multiple configurations of a C program , 2005, 21st IEEE International Conference on Software Maintenance (ICSM'05).

[31]  David Notkin,et al.  A framework for preprocessor-aware C source code analyses , 2000, Softw. Pract. Exp..

[32]  Dawson R. Engler,et al.  A few billion lines of code later , 2010, Commun. ACM.

[33]  Wolfgang Schröder-Preikschat,et al.  Feature consistency in compile-time-configurable system software: facing the linux 10,000 feature problem , 2011, EuroSys '11.

[34]  Wolfgang De Meuter,et al.  Can we refactor conditional compilation into aspects? , 2009, AOSD '09.

[35]  Randal E. Bryant,et al.  Graph-Based Algorithms for Boolean Function Manipulation , 1986, IEEE Transactions on Computers.

[36]  M. Platoff,et al.  An integrated program representation and toolkit for the maintenance of C programs , 1991, Proceedings. Conference on Software Maintenance 1991.

[37]  Bryan Ford,et al.  Parsing expression grammars: a recognition-based syntactic foundation , 2004, POPL '04.

[38]  Patrick Cousot,et al.  A static analyzer for large safety-critical software , 2003, PLDI.

[39]  Russell Impagliazzo,et al.  Derandomizing Polynomial Identity Tests Means Proving Circuit Lower Bounds , 2003, STOC '03.