The Skeleton Document Image Retrieval System
暂无分享,去创建一个
This paper describes Skeleton, a document image retrieval system whose design and implementation is based upon image analysis and document retrieval technologies incorporated in the IDUS image understanding system and the INQUERY information retrieval system, respectively. Most of the current Skeleton development effort has been put into establishing indexing and query formulation methodologies in which both words and character subsequences of words (ngrams) are tokenized. A Web-based interface has been developed for Skeleton which includes a Java image display applet to support word and region highlighting. 1 Overview The Skeleton architecture supports document image analysis, indexing, and retrieval. The indexing component assumes an image analysis in which text regions are grouped into articles with head and body relations identified and text recognized using conventional OCR technology. Functional role distinctions such as dateline and caption are not taken advantage of in the current implementation but in future versions of the system they will be used to extend fielded search capabilities. Both the OCR text output of an indexed unit of text and the corresponding page image containing it can be retrieved by Skeleton. It is assumed that end users will normally want to retrieve page images with appropriate regions highlighted and pointers to other pages of the same document provided. However, our experience has indicated that access to the OCR text output is useful in the search process. In the next two sections, additional information about the image analysis process incorporated into Skeleton from the IDUS system and the system's basic retrieval capabilities inherited from the INQUERY system are described [1,2]. Following this discussion, Skeleton's Web-based interface is illustrated. Skeleton's image analysis component, which was originally developed for use in the IDUS system, is illustrated in Figure 1. The image analysis component includes ScanWorX, a commercial product developed by Xerox Information Systems which provides segmentation and OCR support. Image analysis modules developed specifically for use in IDUS which have been incorporated into Skeleton include a logical analyzer, a document classifier and a functional analyzer. The IDUS image understanding system from which these modules were borrowed provides a sophisticated X-windows user interface for examining image analysis output in a graphical form. Although the IDUS interface is not a component of Skeleton, the ability to use IDUS to evaluate image analysis performance is an attractive option. Since Skeleton relies upon ScanWorX to perform the initial segmentation of document images, it inherits that component's …
[1] S. M. Hardingy,et al. An Evaluation of Information Retrieval Accuracy with Simulated Ocr Output , 1992 .
[2] Deborah A. Dahl,et al. An intelligent document understanding system , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).
[3] Lawrence O'Gorman,et al. Document Image Analysis , 1996 .