Performance of checksums and CRCs over real data

Checksum and CRC algorithms have historically been studied under the assumption that the data fed to the algorithms was entirely random. This paper examines the behavior of checksums and CRCs over real data from various UNIX® file systems. We show that, when given real data in small to modest pieces (e.g., 48 bytes), all the checksum algorithms have skewed distributions. In one dramatic case, 0.01% of the check values appeared nearly 19% of the time. These results have implications for CRCs and checksums when applied to real data. They also cause a spectacular failure rate for the both TCP and Fletcher's checksums when trying to detect certain types of packet splices.