Back in Black: Towards Formal, Black Box Analysis of Sanitizers and Filters

We tackle the problem of analyzing filter and sanitizer programs remotely, i.e. given only the ability to query the targeted program and observe the output. We focus on two important and widely used program classes: regular expression (RE) filters and string sanitizers. We demonstrate that existing tools from machine learning that are available for analyzing RE filters, namely automata learning algorithms, require a very large number of queries in order to infer real life RE filters. Motivated by this, we develop the first algorithm that infers symbolic representations of automata in the standard membership/equivalence query model. We show that our algorithm provides an improvement of x15 times in the number of queries required to learn real life XSS and SQL filters of popular web application firewall systems such as mod-security and PHPIDS. % Active learning algorithms require the usage of an equivalence oracle, i.e. an oracle that tests the equivalence of a hypothesis with the target machine. We show that when the goal is to audit a target filter with respect to a set of attack strings from a context free grammar, i.e. find an attack or infer that none exists, we can use the attack grammar to implement the equivalence oracle with a single query to the filter. Our construction finds on average 90% of the target filter states when no attack exists and is very effective in finding attacks when they are present. For the case of string sanitizers, we show that existing algorithms for inferring sanitizers modelled as Mealy Machines are not only inefficient, but lack the expressive power to be able to infer real life sanitizers. We design two novel extensions to existing algorithms that allow one to infer sanitizers represented as single-valued transducers. Our algorithms are able to infer many common sanitizer functions such as HTML encoders and decoders. Furthermore, we design an algorithm to convert the inferred models into BEK programs, which allows for further applications such as cross checking different sanitizer implementations and cross compiling sanitizers into different languages supported by the BEK backend. We showcase the power of our techniques by utilizing our black-box inference algorithms to perform an equivalence checking between different HTML encoders including the encoders from Twitter, Facebook and Microsoft Outlook email, for which no implementation is publicly available.

[1]  Isil Dillig,et al.  Inductive invariant generation via abductive inference , 2013, OOPSLA.

[2]  Leonard Pitt,et al.  The minimum consistent DFA problem cannot be approximated within any polynomial , 1993, JACM.

[3]  Margus Veanes,et al.  Rex: Symbolic Regular Expression Explorer , 2010, 2010 Third International Conference on Software Testing, Verification and Validation.

[4]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[5]  Nikolai Tillmann,et al.  Reggae: Automated Test Generation for Programs Using Complex Regular Expressions , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[6]  Nikolaj Bjørner,et al.  Symbolic finite state transducers: algorithms and applications , 2012, POPL '12.

[7]  Benjamin Livshits,et al.  Fast and Precise Sanitizer Analysis with BEK , 2011, USENIX Security Symposium.

[8]  Christopher Krügel,et al.  Saner: Composing Static and Dynamic Analysis to Validate Sanitization in Web Applications , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[9]  Benjamin Livshits,et al.  Data-Parallel String-Manipulating Programs , 2015, POPL.

[10]  Dana Angluin,et al.  Learning Regular Sets from Queries and Counterexamples , 1987, Inf. Comput..

[11]  John E. Hopcroft,et al.  An n log n algorithm for minimizing states in a finite automaton , 1971 .

[12]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[13]  Ronald L. Rivest,et al.  Inference of finite automata using homing sequences , 1989, STOC '89.

[14]  Loris D'Antoni,et al.  Extended symbolic finite automata and transducers , 2015, Formal Methods Syst. Des..

[15]  Yasuhiko Minamide Static approximation of dynamically generated Web pages , 2005, WWW '05.

[16]  José L. Balcázar,et al.  Algorithms for Learning Finite Automata from Queries: A Unified View , 1997, Advances in Algorithms, Languages, and Complexity.

[17]  Dawn Xiaodong Song,et al.  Inference and analysis of formal models of botnet command and control protocols , 2010, CCS '10.

[18]  Bengt Jonsson,et al.  A succinct canonical register automaton model , 2015, J. Log. Algebraic Methods Program..

[19]  Benjamin Livshits,et al.  Program Boosting , 2015, POPL.

[20]  Margus Veanes Symbolic String Transformations with Regular Lookahead and Rollback , 2014, Ershov Memorial Conference.

[21]  Bengt Jonsson,et al.  Inferring Canonical Register Automata , 2012, VMCAI.

[22]  Domagoj Babic,et al.  Sigma*: symbolic learning of input-output specifications , 2013, POPL.

[23]  Zhendong Su,et al.  Sound and precise analysis of web applications for injection vulnerabilities , 2007, PLDI '07.

[24]  Christopher Krügel,et al.  Enemy of the State: A State-Aware Black-Box Web Vulnerability Scanner , 2012, USENIX Security Symposium.

[25]  Mihalis Yannakakis,et al.  Black Box Checking , 1999, FORTE.

[26]  Collin Jackson,et al.  Regular expressions considered harmful in client-side XSS filters , 2010, WWW '10.

[27]  Roland Groz,et al.  Inferring Mealy Machines , 2009, FM.

[28]  Bruce W. Watson Implementing and using finite automata toolkits , 1996, Nat. Lang. Eng..

[29]  Oded Maler,et al.  Learning Regular Languages over Large Alphabets , 2014, TACAS.

[30]  Loris D'Antoni,et al.  Minimization of symbolic automata , 2014, POPL.

[31]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[32]  Loris D'Antoni,et al.  Equivalence of Extended Symbolic Finite Transducers , 2013, CAV.

[33]  Arnaud Carayol,et al.  Saturation algorithms for model-checking pushdown systems , 2014, AFL.

[34]  Tsun S. Chow,et al.  Testing Software Design Modeled by Finite-State Machines , 1978, IEEE Transactions on Software Engineering.

[35]  Martin Paul Eve,et al.  XSS Cheat Sheet , 2007 .

[36]  Alex Groce,et al.  Adaptive Model Checking , 2002, Log. J. IGPL.

[37]  E. Mark Gold,et al.  Complexity of Automaton Identification from Given Data , 1978, Inf. Control..

[38]  Matko Botin Sigma*: Symbolic Learning of Input-Output Specifications , 2013 .

[39]  Ali Khalili,et al.  Learning Nondeterministic Mealy Machines , 2014, ICGI.