A Generic Architecture for the Conversion of Document Collections into Semantically Annotated Digital Archives

Mass digitization of document collections with further processing and semantic annotation is an increasing activity among libraries and archives at large for preservation, browsing and navigation, and search purposes. In this paper we propose a software architecture for the process of converting high volumes of document collections to semantically annotated digital libraries. The proposed architecture recognizes two sources of knowledge in the conversion pipeline, namely document images and humans. The Image Analysis module and the Correction and Validation module cover the initial conversion stages. In the former information is automatically extracted from document images. The latter involves human intervention at a technical level to define workflows and to validate the image processing results. The second stage, represented by the Knowledge Capture modules requires information specific to the particular knowledge domain and generally calls for expert practitioners. These two principal conversion stages are coupled with a Knowledge Management module which provides the means to organise the extracted and acquired knowledge. In terms of data propagation, the architecture follows a bottom-up process, starting with document image units, called terms, and progressively building meaningful concepts and their relationships. In the second part of the paper we describe a real scenario with historical document archives implemented according to the proposed architecture.

[1]  Bertin Klein,et al.  smartFIX: A Requirements-Driven System for Document Analysis and Understanding , 2002, Document Analysis Systems.

[2]  Ernest Valveny,et al.  A Platform to Extract Knowledge from Graphic Documents. Application to an Architectural Sketch Understanding Scenario , 2004, Document Analysis Systems.

[3]  Henryk Krawczyk,et al.  The lifecycle of a digital historical document: structure and content , 2004, DocEng '04.

[4]  Josep Lladós,et al.  An Incremental Parser to Recognize Diagram Symbols and Gestures Represented by Adjacency Grammars , 2005, GREC.

[5]  Josep Lladós,et al.  Indexing Historical Documents by Word Shape Signatures , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[6]  Bin Zhang,et al.  Transcript mapping for historic handwritten document images , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[7]  Shijian Lu,et al.  Retrieval of machine-printed Latin documents through Word Shape Coding , 2008, Pattern Recognit..

[8]  Simon M. Lucas,et al.  User-configurable OCR enhancement for online natural history archives , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[9]  Robert Wilensky,et al.  The multivalent browser: a platform for new ideas , 2001, DocEng '01.

[10]  Rafael Dueire Lins,et al.  Generation of images of historical documents by composition , 2002, DocEng '02.

[11]  Bidyut Baran Chaudhuri,et al.  An End-to-End Administrative Document Analysis System , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[12]  Apostolos Antonacopoulos,et al.  A Complete Approach to the Conversion of Typewritten Historical Documents for Digital Archives , 2004, Document Analysis Systems.

[13]  Josep Lladós,et al.  A Pen-Based Interface for Real-Time Document Edition , 2007 .

[14]  Maurizio Rigamonti,et al.  DocMining: A Cooperative Platform for Heterogeneous Document Interpretation According to User-Defined Scenarios , 2003, GREC.

[15]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[16]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[17]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[18]  Jean-Marc Ogier,et al.  DocMining: A Document Analysis System Builder , 2004, Document Analysis Systems.

[19]  Oscar E. Agazzi,et al.  Keyword Spotting in Poorly Printed Documents using Pseudo 2-D Hidden Markov Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..