Research Report: ICARUS: Understanding De Facto Formats by Way of Feathers and Wax

When $a$ data format achieves a significant level of adoption, the presence of multiple format implementations expands the original specification in often-unforeseen ways. This results in an implicitly defined, de facto format, which can create vulnerabilities in programs handling the associated data files. In this paper we present our initial work on ICARUS: a toolchain for dealing with the problem of understanding and hardening de facto file formats. We show the results of our work in progress in the following areas: labeling and categorizing a corpora of data format samples to understand accepted variations of a format; the detection of sublanguages within the de facto format using both entropy- and taint-tracking-based methods, as a means of breaking down the larger problem of learning how the grammar has evolved; grammar inference via reinforcement learning, as a means of tying together the learned sublanguages; and the defining of both safe subsets of the de facto grammar, as well as translations from unsafe regions of the de facto grammar into safe regions. Real-world data formats evolve as they find use in real-world applications, and a comprehensive ICARUS toolchain for understanding and hardening the resulting de facto formats can identify and address security risks arising from this evolution.

[1]  Evan A. Sultanik,et al.  Toward Automated Grammar Extraction via Semantic Labeling of Parser Implementations , 2020, 2020 IEEE Security and Privacy Workshops (SPW).

[2]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[3]  Benjamin C. Pierce,et al.  Combinators for bi-directional tree transformations: a linguistic approach to the view update problem , 2005, POPL '05.

[4]  Rajeev Alur,et al.  Visibly pushdown languages , 2004, STOC '04.

[5]  Rahul Gopinath,et al.  Inferring Input Grammars from Dynamic Control Flow , 2019, ArXiv.

[6]  Mohit Yadav,et al.  Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Auto-Encoders , 2019, NAACL.

[7]  Andreas Zeller,et al.  Mining input grammars from dynamic taints , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[8]  David Fifield,et al.  A better zip bomb , 2019, WOOT @ USENIX Security Symposium.

[9]  Olivier Levillain,et al.  Caradoc: A Pragmatic Approach to PDF Parsing and Validation , 2016, 2016 IEEE Security and Privacy Workshops (SPW).

[10]  Noah A. Smith,et al.  Recurrent Neural Network Grammars , 2016, NAACL.

[11]  Noah A. Smith,et al.  Covariance in Unsupervised Learning of Probabilistic Grammars , 2010, J. Mach. Learn. Res..

[12]  Noam Chomsky,et al.  On Certain Formal Properties of Grammars , 1959, Inf. Control..

[13]  Brian A. Malloy,et al.  Pslr(1): pseudo-scannerless minimal lr(1) for the deterministic parsing of composite languages , 2010 .

[14]  Alexander M. Rush,et al.  Compound Probabilistic Context-Free Grammars for Grammar Induction , 2019, ACL.

[15]  Simson L. Garfinkel,et al.  Bringing science to digital forensics with standardized forensic corpora , 2009, Digit. Investig..

[16]  David Walker,et al.  Incremental learning of system log formats , 2010, OPSR.

[17]  David Walker,et al.  From dirt to shovels: fully automatic tool generation from ad hoc data , 2008, POPL '08.

[18]  David Walker,et al.  The PADS project: an overview , 2011, ICDT '11.