Taming the Length Field in Binary Data: Calc-Regular Languages

When binary data are sent over a byte stream, the binary format sender and receiver are using is a "data serialization language", either explicitely specified, or implied by the implementations. Security is at risk when sender and receiver disagree on details of this language. If, e.g., the receiver fails to reject invalid messages, an adversary may assemble such invalid messages to compromise the receiver's security. Many data serialization languages are length-prefix languages. When sending/storing some F of flexible size, F is encoded at the binary level as a pair (|F|, F), with |F| representing the length of F (typically in bytes). This paper's main contributions and results are as follows. (1) Length-prefix langages are not context-free. This might seem to justify the conjecture that parsing those languages is difficult and not efficient. (2) The class of "calc-regular languages" is proposed, a minimalistic extension of regular languages with the additional property of handling length-fields. Calc-regular languages can be specified via "calc-regular expressions", a natural extension of regular expressions. (3) Calc-regular languages are almost as easy to parse as regular languages, using finite-state machines with additional accumulators. This disproves the conjecture from (1).