Scaling Up Toward Automated Black-box Reverse Engineering of Context-Free Grammars

Black-box context-free grammar inference is a hard problem as in many practical settings it only has access to a limited number of example programs. The state-of-the-art approach Arvada heuristically generalizes grammar rules starting from flat parse trees and is non-deterministic to explore different generalization sequences. We observe that many of Arvada's generalization steps violate common language concept nesting rules. We thus propose to pre-structure input programs along these nesting rules, apply learnt rules recursively, and make black-box context-free grammar inference deterministic. The resulting TreeVada yielded faster runtime and higher-quality grammars in an empirical comparison.

[1]  M. Young,et al.  Finding Short Slow Inputs Faster with Grammar-Based Search , 2023, International Symposium on Software Testing and Analysis.

[2]  A. Zeller,et al.  “Synthesizing input grammars”: a replication study , 2022, PLDI.

[3]  Eric J. Rapos,et al.  Simulink Model Transformation for Backwards Version Compatibility , 2021, 2021 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C).

[4]  Koushik Sen,et al.  Learning Highly Recursive Input Grammars , 2021, 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[5]  Mathias Payer,et al.  Gramatron: effective grammar-aware fuzzing , 2021, ISSTA.

[6]  Daniel M. Yellin,et al.  Synthesizing Context-free Grammars from Recurrent Neural Networks , 2021, TACAS.

[7]  A. Zeller,et al.  Mining input grammars from dynamic control flow , 2020, ESEC/SIGSOFT FSE.

[8]  Lars Grunske,et al.  MoFuzz: A Fuzzer Suite for Testing Model-Driven Software Engineering Tools , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9]  Tao Xie,et al.  REINAM: reinforcement learning for input-grammar inference , 2019, ESEC/SIGSOFT FSE.

[10]  Ngoc Thang Vu,et al.  Learning the Dyck Language with Attention-based Seq2Seq Models , 2019, BlackboxNLP@ACL.

[11]  Claire Le Goues,et al.  Lightweight multi-language syntax transformation with parser parser combinators , 2019, PLDI.

[12]  Tibor Gyimóthy,et al.  Grammarinator: a grammar-based open source fuzzer , 2018, A-TEST@ESEC/SIGSOFT FSE.

[13]  Robert C. Berwick,et al.  Evaluating the Ability of LSTMs to Learn Context-Free Grammars , 2018, BlackboxNLP@EMNLP.

[14]  Chris Cummins,et al.  Compiler fuzzing through deep learning , 2018, ISSTA.

[15]  Jean-Philippe Bernardy,et al.  Can Recurrent Neural Networks Learn Nested Recursion? , 2018, LILT.

[16]  Anthony Cleve,et al.  A Static Code Smell Detector for SQL Queries Embedded in Java Code , 2017, 2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[17]  Rishabh Singh,et al.  Learn&Fuzz: Machine learning for input fuzzing , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18]  Andreas Zeller,et al.  Mining input grammars from dynamic taints , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[19]  Alexander Aiken,et al.  Synthesizing program input grammars , 2016, PLDI.

[20]  Tomoki Toda,et al.  Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[21]  James R. Cordy,et al.  A survey of grammatical inference in software engineering , 2014, Sci. Comput. Program..

[22]  Guy L. Steele,et al.  The Java Language Specification, Java SE 8 Edition , 2013 .

[23]  Xiangyu Zhang,et al.  Reverse Engineering Input Syntactic Structure from Program Execution and Its Applications , 2010, IEEE Transactions on Software Engineering.

[24]  Xiangyu Zhang,et al.  Deriving input syntactic structure from execution , 2008, SIGSOFT '08/FSE-16.

[25]  Adam Kiezun,et al.  Grammar-based whitebox fuzzing , 2008, PLDI '08.

[26]  Pat Langley,et al.  Learning Context-Free Grammars with a Simplicity Bias , 2000, ECML.

[27]  Colin de la Higuera,et al.  Characteristic Sets for Polynomial Grammatical Inference , 1997, Machine Learning.

[28]  Yasubumi Sakakibara,et al.  Efficient Learning of Context-Free Grammars from Positive Structural Examples , 1992, Inf. Comput..

[29]  Ming Li,et al.  Learning Simple Concept Under Simple Distributions , 1991, SIAM J. Comput..

[30]  Dana Angluin,et al.  When won't membership queries help? , 1991, STOC '91.

[31]  Yasubumi Sakakibara,et al.  Learning context-free grammars from structural data in polynomial time , 1988, COLT '88.

[32]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[33]  Dana Angluin,et al.  A Note on the Number of Queries Needed to Identify Regular Languages , 1981, Inf. Control..

[34]  Thorsten Holz,et al.  GRIMOIRE: Synthesizing Structure while Fuzzing , 2019, USENIX Security Symposium.

[35]  Tae-Woong Kim,et al.  Specification and Automated Detection of Code Smells using OCL , 2013 .

[36]  Colin de la Higuera,et al.  LARS: A learning algorithm for rewriting systems , 2006, Machine Learning.

[37]  Jean Berstel,et al.  Balanced Grammars and Their Languages , 2002, Formal and Natural Computing.

[38]  Leon Moonen,et al.  Generating robust parsers using island grammars , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[39]  Noam Chomsky,et al.  The Algebraic Theory of Context-Free Languages* , 1963 .