Digital libraries and document image analysis

The rapid growth of digital libraries (DLs) worldwide poses many new challenges for document image analysis (DIA) research and development. DLs promise to offer more people access to larger document collections, and at far greater speed, than physical libraries can. But DLs also tend, for many reasons, to serve poorly, or even to omit entirely, many types of non-digital human-legible media, such as originally printed and handwritten documents. These media, in their original physical (undigitized) form, are readily - if not always quickly - legible, searchable, and browseable, whereas in the form of document images accessed through DLs they often lose many of their original advantages while of course lacking many advantages of symbolically encoded information. The author explores these issues and illustrates them with brief case studies arising from his experience as a DIA researcher in collaboration with several DL projects in the US. Difficult open DIA technical problems in DL applications are identified in the contrasting advantages of paper and digital displays, at every stage of capture, early processing, recognition, analysis, presentation, retrieval, and in personal and interactive applications. These support the conclusion that the international DIA R & D community is urgently needed (because uniquely qualified) to provide new technology to help rescue from neglect - even, in many cases, eventual oblivion - the world's vast culturally irreplaceable legacy paper document collections.

[1]  Mark Shelton,et al.  The Digital Library: A Biography , 2004 .

[2]  Susan Hamburger,et al.  The State of Digital Preservation: An International Perspective , 2003 .

[3]  Christine Reid,et al.  The Myth of the Paperless Office , 2003, J. Documentation.

[4]  Kristen Maria Summers Document image improvement for OCR as a classification problem , 2003, IS&T/SPIE Electronic Imaging.

[5]  Henry S. Baird,et al.  BaffleText: a Human Interactive Proof , 2003, IS&T/SPIE Electronic Imaging.

[6]  David S. Doermann,et al.  Bootstrapping structured page segmentation , 2003, IS&T/SPIE Electronic Imaging.

[7]  Proceedings Seventh International Conference on Document Analysis and Recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[8]  Elisa H. Barney Smith,et al.  Relating Statistical Image Differences and Degradation Features , 2002, Document Analysis Systems.

[9]  Kris Popat,et al.  Paper to PDA , 2002, Object recognition supported by user interaction for service robots.

[10]  Kate Zankowicz Double Fold: Libraries and the Assault on Paper , 2002 .

[11]  Richard J. Cox Vandals in the Stacks?: A Response to Nicholson Baker's Assault on Libraries , 2002 .

[12]  Kris Popat,et al.  Decoding of text lines in grayscale document images , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  N. Baker Double Fold: Libraries and the Assault on Paper , 2001 .

[14]  Robert Wilensky,et al.  Multivalent documents , 2000, CACM.

[15]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Ray R. Larson,et al.  Practical digital libraries: Books, bytes & bucks , 1998 .

[17]  Henry S. Baird,et al.  Special Issue on Document Image Understanding and Retrieval , 1998, Comput. Vis. Image Underst..

[18]  Francine Chen,et al.  Summarization of Imaged Documents without OCR , 1998, Comput. Vis. Image Underst..

[19]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[20]  Michael E. Lesk,et al.  Practical Digital Libraries: Books, Bytes, and Bucks , 1997 .

[21]  Edward R. Dougherty,et al.  Enhancement and Restoration of Digital Documents: Statistical Design of Nonlinear Algorithms , 1997 .

[22]  Marti A. Hearst Research in Support of Digital Libraries at Xerox PARC; Part I: The Changing Social Roles of Documents , 1996, D Lib Mag..

[23]  Gary E. Kopec Document image decoding in the UC Berkeley Digital Library , 1996, Electronic Imaging.

[24]  A. Lawrence Spitz Spam: a Scientific Paper Access Method , 1996, DAS.

[25]  Yelena Yesha,et al.  Digital Libraries: Current Issues, Digital Libraries Workshop, Newark, NJ, USA, May 19-20, 1994, Selected Papers , 1995, DL.

[26]  Yelena Yesha,et al.  Digital Libraries Current Issues , 1995, Lecture Notes in Computer Science.

[27]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .

[28]  Theodosios Pavlidis,et al.  Direct Gray-Scale Extraction of Features for Character Recognition , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Andrew Dillon,et al.  Reading from paper versus screens: a critical review of the empirical literature , 1992 .

[30]  T. Cockerell,et al.  THE FLORA OF CALIFORNIA. , 1937, Science.