Using Selective Memoization to Defeat Regular Expression Denial of Service

Regular expressions (regexes) are a denial of service vector in most mainstream programming languages. Recent empirical work has demonstrated that up to 10% of regexes have super-linear worst-case behavior in typical regex engines. It is therefore not surprising that many web services are reportedly vulnerable to regex denial of service (ReDoS). If the time complexity of a regex engine can be reduced transparently, ReDoS vulnerabilities can be eliminated at no cost to application developers. Unfortunately, existing ReDoS defenses — replacing the regex engine, optimizing it, or replacing regexes piecemeal — struggle with soundness and compatibility. Full memoization is sound and compatible, but its space costs are too high. No effective ReDoS defense has been adopted in practice. We present techniques to provably eliminate super-linear regex behavior with low space costs for typical regexes. We propose selective memoization schemes with varying space/time tradeoffs. We then describe an encoding scheme that leverages insights about regex engine semantics to further reduce the space cost of memoization. We also consider how to safely handle extended regex features. We implemented our proposals and evaluated them on a corpus of real-world regexes. We found that selective memoization lowers the space cost of memoization by an order of magnitude for the median regex, and that run-length encoding further lowers the space cost to constant for 90% of regexes.