A Toolkit for Modeling and Compressing Audit Data

System administrators face trade-o s concerning the volume of audit data to collect and retain. Not all approaches have easily quanti ed costs, but lossless compression o ers an adjustable trade-o of storage for compute time. Compression techniques designed into the data format can complicate software that consumes the data, and are not adjustable to suit the needs of diverse sites. General-purpose compression tools permit some adjustment, but cannot exploit sophisticated models of the data. The toolkit described here simpli es tailoring compression tools to the properties of the data at any time after the data format is speci ed. Using the toolkit, a few days of work de ning models can achieve compression 13% better than gzip on an existing commercial audit format, with many known properties of the data remaining to be exploited by re nements of the models for still better compression. A customized compression tool could also be designed to permit recovery of data from a compressed stream without decompressing the entire stream.