Segmentation-based retrieval of document images from diverse collections

We describe a methodology for retrieving document images from large extremely diverse collections. First we perform content extraction, that is the location and measurement of regions containing handwriting, machine-printed text, photographs, blank space, etc, in documents represented as bilevel, greylevel, or color images. Recent experiments have shown that even modest per-pixel content classification accuracies can support usefully high recall and precision rates (of, e.g., 80-90%) for retrieval queries within document collections seeking pages that contain a fraction of a certain type of content. When the distribution of content and error rates are uniform across the entire collection, it is possible to derive IR measures from classification measures and vice versa. Our largest experiments to date, consisting of 80 training images totaling over 416 million pixels, are presented to illustrate these conclusions. This data set is more representative than previous experiments, containing a more balanced distribution of content types. Contained in this data set are also images of text obtained from handheld digital cameras and the success of existing methods (with no modification) in classifying these images with are discussed. Initial experiments in discriminating line art from the four classes mentioned above are also described. We also discuss methodological issues that affect both ground-truthing and evaluation measures.

[1]  Henry S. Baird,et al.  Versatile document image content extraction , 2006, Electronic Imaging.

[2]  Henry S. Baird,et al.  Towards Versatile Document Analysis Systems , 2006, Document Analysis Systems.

[3]  Henry S. Baird,et al.  Document Content Inventory and Retrieval , 2007 .

[4]  Matthew R. Casey FAST APPROXIMATE NEAREST NEIGHBORS , 2006 .

[5]  Mandar Mitra,et al.  Information Retrieval from Documents: A Survey , 2000, Information Retrieval.

[6]  Henry S. Baird,et al.  Iterated Document Content Classification , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[7]  Yasuto Ishitani,et al.  Model-based information extraction method tolerant of OCR errors for document images , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[8]  Yasuto Ishitani Model-based Information Extraction Method Tolerant of OCR Errors for Document Images , 2002, Int. J. Comput. Process. Orient. Lang..

[9]  Jonathan J. Hull,et al.  Document image database retrieval and browsing using texture analysis , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[10]  Paul Blackburn Document Image Retrieval in HSBC Trustee , 1999 .

[11]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[12]  Henry S. Baird,et al.  Special Issue on Document Image Understanding and Retrieval , 1998, Comput. Vis. Image Underst..

[13]  Henry S. Baird,et al.  Document Content Inventory & Retrieval , 2007 .

[14]  Ian H. Witten,et al.  Compressing and indexing documents and images , 1999 .

[15]  Henry S. Baird,et al.  Document image content inventories , 2007, Electronic Imaging.