Learning document structure for retrieval and classification

In this paper, we present a method for the retrieval of document images with chosen layout characteristics. The proposed method is based on statistics of patch-codewords over different regions of image. We begin with a set of wanted and a random set of unwanted images representative of a large heterogeneous collection. We then use raw-image patches extracted from the unlabeled images to learn a codebook. To model the spatial relationships between patches, the image is recursively partitioned horizontally and vertically, and a histogram of patch-codewords is computed in each partition. The resulting set of features give a high precision and recall for the retrieval of hand-drawn and machine-print table-documents, and unconstrained mixed form-type documents, when trained using a random forest classifier. We compare our method to the spatial-pyramid method, and show that the proposed approach for learning layout characteristics is competitive for document images.