DocBrowse: a system for information retrieval from document image data

This paper presents the software architecture for DocBrowse: a system for mixed text/graphics document image analysis and retrieval. DocBrowse is an open and extensible environment that permits the user to visually manage and perform queries on highly degraded document image databases. DocBrowse also serves as a research environment for developing document image analysis and query by image example (QBIE) algorithms. The system consists of a user interface, an object-relational document database and a variety of document image analysis engines. Using DocBrowse, it is possible to perform queries that retrieve documents based on both graphical and textual content. We describe the graphical user interface and visual image browser that is used to perform such queries. We also describe our approach to QBIE, the database structure, and the analysis engines incorporated in DocBrowse.