论文信息 - Scaling Up Toward Automated Black-box Reverse Engineering of Context-Free Grammars

Scaling Up Toward Automated Black-box Reverse Engineering of Context-Free Grammars

Black-box context-free grammar inference is a hard problem as in many practical settings it only has access to a limited number of example programs. The state-of-the-art approach Arvada heuristically generalizes grammar rules starting from flat parse trees and is non-deterministic to explore different generalization sequences. We observe that many of Arvada's generalization steps violate common language concept nesting rules. We thus propose to pre-structure input programs along these nesting rules, apply learnt rules recursively, and make black-box context-free grammar inference deterministic. The resulting TreeVada yielded faster runtime and higher-quality grammars in an empirical comparison.

Christoph Csallner | Suraj Shetiya | Mohammad Rifat Arefin | Z. Wang

[1] M. Young,et al. Finding Short Slow Inputs Faster with Grammar-Based Search , 2023, International Symposium on Software Testing and Analysis.

[2] A. Zeller,et al. “Synthesizing input grammars”: a replication study , 2022, PLDI.

[3] Eric J. Rapos,et al. Simulink Model Transformation for Backwards Version Compatibility , 2021, 2021 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C).

[4] Koushik Sen,et al. Learning Highly Recursive Input Grammars , 2021, 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[5] Mathias Payer,et al. Gramatron: effective grammar-aware fuzzing , 2021, ISSTA.

[6] Daniel M. Yellin,et al. Synthesizing Context-free Grammars from Recurrent Neural Networks , 2021, TACAS.

[7] A. Zeller,et al. Mining input grammars from dynamic control flow , 2020, ESEC/SIGSOFT FSE.

[8] Lars Grunske,et al. MoFuzz: A Fuzzer Suite for Testing Model-Driven Software Engineering Tools , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9] Tao Xie,et al. REINAM: reinforcement learning for input-grammar inference , 2019, ESEC/SIGSOFT FSE.

[10] Ngoc Thang Vu,et al. Learning the Dyck Language with Attention-based Seq2Seq Models , 2019, BlackboxNLP@ACL.

[11] Claire Le Goues,et al. Lightweight multi-language syntax transformation with parser parser combinators , 2019, PLDI.

[12] Tibor Gyimóthy,et al. Grammarinator: a grammar-based open source fuzzer , 2018, A-TEST@ESEC/SIGSOFT FSE.

[13] Robert C. Berwick,et al. Evaluating the Ability of LSTMs to Learn Context-Free Grammars , 2018, BlackboxNLP@EMNLP.

[14] Chris Cummins,et al. Compiler fuzzing through deep learning , 2018, ISSTA.

[15] Jean-Philippe Bernardy,et al. Can Recurrent Neural Networks Learn Nested Recursion? , 2018, LILT.

[16] Anthony Cleve,et al. A Static Code Smell Detector for SQL Queries Embedded in Java Code , 2017, 2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[17] Rishabh Singh,et al. Learn&Fuzz: Machine learning for input fuzzing , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18] Andreas Zeller,et al. Mining input grammars from dynamic taints , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[19] Alexander Aiken,et al. Synthesizing program input grammars , 2016, PLDI.

[20] Tomoki Toda,et al. Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[21] James R. Cordy,et al. A survey of grammatical inference in software engineering , 2014, Sci. Comput. Program..

[22] Guy L. Steele,et al. The Java Language Specification, Java SE 8 Edition , 2013 .

[23] Xiangyu Zhang,et al. Reverse Engineering Input Syntactic Structure from Program Execution and Its Applications , 2010, IEEE Transactions on Software Engineering.

[24] Xiangyu Zhang,et al. Deriving input syntactic structure from execution , 2008, SIGSOFT '08/FSE-16.

[25] Adam Kiezun,et al. Grammar-based whitebox fuzzing , 2008, PLDI '08.

[26] Pat Langley,et al. Learning Context-Free Grammars with a Simplicity Bias , 2000, ECML.

[27] Colin de la Higuera,et al. Characteristic Sets for Polynomial Grammatical Inference , 1997, Machine Learning.

[28] Yasubumi Sakakibara,et al. Efficient Learning of Context-Free Grammars from Positive Structural Examples , 1992, Inf. Comput..

[29] Ming Li,et al. Learning Simple Concept Under Simple Distributions , 1991, SIAM J. Comput..

[30] Dana Angluin,et al. When won't membership queries help? , 1991, STOC '91.

[31] Yasubumi Sakakibara,et al. Learning context-free grammars from structural data in polynomial time , 1988, COLT '88.

[32] Leslie G. Valiant,et al. A theory of the learnable , 1984, STOC '84.

[33] Dana Angluin,et al. A Note on the Number of Queries Needed to Identify Regular Languages , 1981, Inf. Control..

[34] Thorsten Holz,et al. GRIMOIRE: Synthesizing Structure while Fuzzing , 2019, USENIX Security Symposium.

[35] Tae-Woong Kim,et al. Specification and Automated Detection of Code Smells using OCL , 2013 .

[36] Colin de la Higuera,et al. LARS: A learning algorithm for rewriting systems , 2006, Machine Learning.

[37] Jean Berstel,et al. Balanced Grammars and Their Languages , 2002, Formal and Natural Computing.

[38] Leon Moonen,et al. Generating robust parsers using island grammars , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[39] Noam Chomsky,et al. The Algebraic Theory of Context-Free Languages* , 1963 .