Digital mountain: from granite archive to global access

Large-scale, multiterabyte digital libraries are becoming feasible due to decreasing costs of storage, CPU, and bandwidth. However, costs associated with preparing content for input into the library remain high due to the amount of human labor required. We describe the digital microfilm pipeline -sequence of image processing operations used to populate a large-scale digital library from a "mountain" of microfilm and reduce the human labor involved. Essential parts of the pipeline include algorithms for document zoning and labeling, consensus-based template creation, reversal of geometric transformations and just-in-time browsing, an interactive technique for progressive access of image content over a low-bandwidth medium. We also suggest more automated approaches to cropping, enhancement and data extraction.

[1]  Michael D. Garris,et al.  Evaluating spatial correspondence of zones in document recognition systems , 1995, Proceedings., International Conference on Image Processing.

[2]  Henry S. Baird,et al.  Language-free layout analysis , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[3]  Uwe Rauschenbach Compression of palettized images with progressive coding of the color information , 2000, Visual Communications and Image Processing.

[4]  Tarak Gandhi,et al.  Structure recognition and information extraction from tabular documents , 1996 .

[5]  Rangachar Kasturi,et al.  Structural recognition of tabulated data , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[6]  Yoshua Bengio,et al.  High quality document image compression with "DjVu" , 1998, J. Electronic Imaging.

[7]  William A. Barrett,et al.  Just-in-time browsing for digitized microfilm and other similar image collections , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[8]  Yuan Yan Tang,et al.  Automatic document processing: A survey , 1996, Pattern Recognit..

[9]  George Nagy,et al.  DOCUMENT ANALYSIS WITH AN EXPERT SYSTEM , 1986 .

[10]  Yuan Yan Tang,et al.  Multiresolution analysis in extraction of reference lines from documents with gray level background , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Hsi-Jian Lee,et al.  An Efficient Algorithm For Form Structure Extraction Using Strip Projection , 1998, Pattern Recognit..

[12]  William A. Barrett,et al.  Consensus-based table form recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[13]  K. Tzou Progressive Image Transmission: A Review And Comparison Of Techniques , 1987 .

[14]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Thomas Kieninger,et al.  The T-Recs Table Recognition and Analysis System , 1998, Document Analysis Systems.

[16]  Seong-Whan Lee,et al.  Table Structure Extraction from Form Documents , 1998, Document Analysis Systems.

[17]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[18]  A. Peter Johnson,et al.  A Fast Algorithm for Bottom-Up Document Layout Analysis , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..