Steganalysis and authentication of binary document images

This dissertation research addresses two important problems in document image security: steganalysis and authentication. We have developed four novel steganalysis techniques for binary document images, and a robust technique for the authentication of documents. The steganalysis techniques we developed can be used as a counter measure to steganography when document images are used as cover media. The first steganalysis technique uses a cubic curve model to estimate pixel positions along character or symbol boundaries. Then the statistics of estimation errors are used to detect stegoimages and to estimate the length of the hidden messages. The second steganalysis technique uses compression bit rate as a distinguishing statistic to distinguish stego images from unmarked images. We specifically used the JBIG-2 binary image compression algorithm to derive a quantitative relation between compression bit rate and embedding rate. The third steganalysis technique was developed for detecting stego images when document images degraded with print and scan noise are used as cover media. This technique makes use of a document degradation model for print and scan processes. The fourth steganalysis technique was developed for detecting stego images when halftone images are used as cover media. We first convert a halftone image into grayscale-like images using low-pass filtering. A set of statistical features are then extracted for classifying candidate images into stego or unmarked images. In the document authentication method we developed, characters and symbols are first grouped into different classes based on k-means clustering in the feature space. Labels are then assigned to the different classes. The ordered sequence of labels for the characters and symbols is then used to compute a hash code for the document by using a cryptographic hash function and a private key. The proposed technique tolerates noise introduced by print and scan operations, but is capable of detecting intentional content alternations done to a document.