A Framework for the Encoding of Multilayered Documents

Electronic publishing of material digitized using imaging and OCR calls for a special delivery format capable of reconstructing original documents in a well-usable electronic form. We present a framework for the universal encoding of multilingual image-on-text documents, enabling retrieval systems to text-search and highlight hits on original page images. A generalized format for representation of image-on-text allows for integration of different OCR engines and target format encoders. This framework's current implementation encodes multilingual content into DjVu and PDF. Performance has been evaluated with focus on file size and shown that overhead of adding text layers is small compared to advantages and that output is comparable to other systems.

[1]  Francesca Cesarini,et al.  A general system for the retrieval of document images from digital libraries , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[2]  J. C. Tressler,et al.  Fourth Edition , 2006 .

[3]  Mark Davis,et al.  The Unicode Standard, Version 3.0 , 2000 .

[4]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[5]  Yann LeCun,et al.  DjVu: analyzing and compressing scanned documents for Internet distribution , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[6]  Yasuto Ishitani,et al.  Document transformation system from papers to XML data based on pivot XML document method , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[7]  Robert Wilensky,et al.  The multivalent browser: a platform for new ideas , 2001, DocEng '01.

[8]  Sargur N. Srihari,et al.  Representing OCRed documents in HTML , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[9]  Liangrui Peng,et al.  Document digitization technology and its application for digital library in China , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[10]  Philippe Lefèvre,et al.  ODIL: an SGML description language of the layout structure of documents , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[11]  Kazem Taghva,et al.  Information access in the presence of OCR errors , 2004, HDP '04.

[12]  P. Lu Adobe Systems Inc , 2005 .

[13]  Michael E. Lesk,et al.  Making a digital library: the contents of the CORE project , 1997, TOIS.

[14]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[15]  C. Lee Giles,et al.  Indexing and retrieval of scientific literature , 1999, CIKM '99.

[16]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..