Ensemble LUT classification for degraded document enhancement

The fast evolution of scanning and computing technologies have led to the creation of large collections of scanned paper documents. Examples of such collections include historical collections, legal depositories, medical archives, and business archives. Moreover, in many situations such as legal litigation and security investigations scanned collections are being used to facilitate systematic exploration of the data. It is almost always the case that scanned documents suffer from some form of degradation. Large degradations make documents hard to read and substantially deteriorate the performance of automated document processing systems. Enhancement of degraded document images is normally performed assuming global degradation models. When the degradation is large, global degradation models do not perform well. In contrast, we propose to estimate local degradation models and use them in enhancing degraded document images. Using a semi-automated enhancement system we have labeled a subset of the Frieder diaries collection.1 This labeled subset was then used to train an ensemble classifier. The component classifiers are based on lookup tables (LUT) in conjunction with the approximated nearest neighbor algorithm. The resulting algorithm is highly effcient. Experimental evaluation results are provided using the Frieder diaries collection.1

[1]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[2]  Apostolos Antonacopoulos,et al.  Semantics-based content extraction in typewritten historical documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[3]  Ophir Frieder,et al.  Degraded document image enhancement , 2007, Electronic Imaging.

[4]  Michael Brady,et al.  Visual enhancement of incised text , 2003, Pattern Recognit..

[5]  Apostolos Antonacopoulos,et al.  Document image analysis for World War II personal records , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[6]  Apostolos Antonacopoulos,et al.  A Complete Approach to the Conversion of Typewritten Historical Documents for Digital Archives , 2004, Document Analysis Systems.

[7]  George Nagy,et al.  Combining Dichotomizers for MAP Field Classification , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[8]  Mohammed Noamany,et al.  Comparative evaluation of different classifiers for robust distorted-character recognition , 2006, Electronic Imaging.

[9]  Efstathios Stamatatos,et al.  Improving the quality of degraded document images , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[10]  Ioannis Pratikakis,et al.  An Adaptive Binarization Technique for Low Quality Historical Documents , 2004, Document Analysis Systems.

[11]  Tapas Kanungo,et al.  Morphological degradation models and their use in document image restoration , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[12]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[13]  Tapas Kanungo,et al.  Document degradation models and a methodology for degradation model validation , 1996 .

[14]  Tapas Kanungo,et al.  Estimation of morphological degradation model parameters , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[15]  Henry S. Baird,et al.  Document image quality: making fine discriminations , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).