Document cleanup using page frame detection

AbstractWhen a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.

[1]  Henry S. Baird Background Structure in Document Images , 1994, Int. J. Pattern Recognit. Artif. Intell..

[2]  B. Gatos,et al.  Automatic Borders Detection of Camera Document Images , 2007 .

[3]  Thomas M. Breuel,et al.  Performance Comparison of Six Algorithms for Page Segmentation , 2006, Document Analysis Systems.

[4]  Thomas M. Breuel,et al.  Distance measures for layout-based document image retrieval , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[5]  Faisal Shafait Document Image Dewarping Contest , 2007 .

[6]  Amit Kumar Das,et al.  An empirical measure of the performance of a document image segmentation algorithm , 2002, International Journal on Document Analysis and Recognition.

[7]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[8]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[9]  Rafael Dueire Lins,et al.  Efficient Removal of Noisy Borders of Monochromatic Documents , 2009, ICIAR.

[10]  Song Mao,et al.  Software architecture of PSET: a page segmentation evaluation toolkit , 2002, International Journal on Document Analysis and Recognition.

[11]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[12]  Christoph H. Lampert,et al.  Document image dewarping using robust estimation of curled text lines , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[13]  Matti Pietikäinen,et al.  Robust skew estimation on low-resolution document images , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[14]  T. Breuel Recognition by Adaptive Subdivision of Transformation Space: practical experiences and comparison with the Hough transform , 1993 .

[15]  Rafael Dueire Lins,et al.  Efficient Removal of Noisy Borders from Monochromatic Documents , 2004, ICIAR.

[16]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Thomas M. Breuel,et al.  Implementation techniques for geometric branch-and-bound matching methods , 2003, Comput. Vis. Image Underst..

[18]  Thomas M. Breuel,et al.  Page Frame Detection for Marginal Noise Removal from Scanned Documents , 2007, SCIA.

[19]  Andreas Dengel,et al.  ANASTASIL: A Hybrid Knowledge-Based System for Document Layout Analysis , 1989, IJCAI.

[20]  Stefano Messelodi,et al.  Geometric Layout Analysis Techniques for Document Image Understanding: a Review , 2008 .

[21]  Thomas M. Breuel,et al.  Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Robert M. Haralick,et al.  Performance Evaluation of Document Structure Extraction Algorithms , 2001, Comput. Vis. Image Underst..

[23]  Lawrence O'Gorman,et al.  Document Image Analysis , 1996 .

[24]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[25]  D.X. Le,et al.  Automated borders detection and adaptive segmentation for binary document images , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[26]  Thomas M. Breuel Robust least-square-baseline finding using a branch and bound algorithm , 2001, IS&T/SPIE Electronic Imaging.

[27]  Franck van Breugell An introduction to metric semantics: operational and denotational models for programming and specification languages , 2001 .

[28]  Kuo-Chin Fan,et al.  Marginal noise removal of document images , 2002, Pattern Recognit..

[29]  Thomas M. Breuel A practical, globally optimal algorithm for geometric matching under uncertainty , 2001, Electron. Notes Theor. Comput. Sci..

[30]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[31]  Basilios Gatos,et al.  Page Segmentation Competition , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[32]  Luigi Cinque,et al.  Segmentation of page images having artifacts of photocopying and scanning , 2002, Pattern Recognit..

[33]  Thomas M. Breuel On the use of interval arithmetic in geometric branch and bound algorithms , 2003, Pattern Recognit. Lett..