An approach for detecting and cleaning of struck-out handwritten text

Abstract This paper deals with the identification and processing of struck-out texts in unconstrained offline handwritten document images. If run on the OCR engine, such texts will produce nonsense character-string outputs. Here we present a combined (a) pattern classification and (b) graph-based method for identifying such texts. In case of (a), a feature-based two-class (normal vs. struck-out text) SVM classifier is used to detect moderate-sized struck-out components. In case of (b), skeleton of the text component is considered as a graph and the strike-out stroke is identified using a constrained shortest path algorithm. To identify zigzag or wavy struck-outs, all paths are found and some properties of zigzag and wavy line are utilized. Some other types of strike-out stroke are also detected by modifying the above method. The large sized multi-word and multi-line struck-outs are segmented into smaller components and treated as above. The detected struck-out texts can then be blocked from entering the OCR engine. In another kind of application involving historical documents, page images along with their annotated ground-truth are to be generated. In this case the strike-out strokes can be deleted from the words and then fed to the OCR engine. For this purpose an inpainting-based cleaning approach is employed. We worked on 500 pages of documents and obtained an overall F-Measure of 91.56% (91.06%) in English (Bengali) script for struck-out text detection. Also, for strike-out stroke identification and deletion, the F-Measures obtained were 89.65% (89.31%) and 91.16% (89.29%), respectively.

[1]  Zicheng Guo,et al.  Parallel thinning with two-subiteration algorithms , 1989, Commun. ACM.

[2]  C. V. Jawahar,et al.  Contextual restoration of severely degraded document images , 2009, CVPR.

[3]  Laurence Likforman-Sulem,et al.  HMM-based Offline Recognition of Handwritten Words Crossed Out with Different Kinds of Strokes , 2008 .

[4]  Ioannis Pratikakis,et al.  A combined approach for the binarization of handwritten document images , 2014, Pattern Recognit. Lett..

[5]  Núria Cirera,et al.  BH2M: The Barcelona Historical, Handwritten Marriages Database , 2014, 2014 22nd International Conference on Pattern Recognition.

[6]  Subhadip Basu,et al.  CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[7]  Michael P. Caligiuri,et al.  The Neuroscience of Handwriting: Applications for Forensic Document Examination , 2012 .

[8]  Bidyut Baran Chaudhuri,et al.  An Approach of Strike-Through Text Identification from Handwritten Documents , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[9]  Xin Wang,et al.  Parsing ink annotations on heterogeneous documents , 2006, SBM'06.

[10]  Ching Y. Suen,et al.  A new benchmark on the recognition of handwritten Bangla and Farsi numeral characters , 2009, Pattern Recognit..

[11]  Flávio Bortolozzi,et al.  Brazilian forensic letter database , 2008 .

[12]  Umapada Pal,et al.  Handwriting Recognition in Indian Regional Scripts: A Survey of Offline Techniques , 2012, TALIP.

[13]  Fei Yin,et al.  Online and offline handwritten Chinese character recognition: Benchmarking on new databases , 2013, Pattern Recognit..

[14]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[15]  Juan Carlos Pérez-Cortes,et al.  Rejection strategies and confidence measures for a k-NN classifier in an OCR task , 2002, Object recognition supported by user interaction for service robots.

[16]  B. B. Chaudhuri On OCR of Major Indian Scripts: Bangla and Devanagari , 2009 .

[17]  Apostolos Antonacopoulos,et al.  Handwriting Segmentation Contest , 2007, ICDAR.

[18]  Lambert Schomaker,et al.  Automatic removal of crossed-out handwritten text and the effect on writer verification and identification , 2008, Electronic Imaging.

[19]  Stephane. Nicolas,et al.  Markov Random Field Models to Extract The Layout of Complex Handwritten Documents , 2006 .

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  Michael P. Caligiuri,et al.  Comprar The Neuroscience Of Handwriting. Applications For Forensic Document Examination | Michael P. Caligiuri | 9781439871409 | CRC PRESS , 2012 .

[22]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[23]  Xin Li,et al.  Image Recovery Via Hybrid Sparse Representations: A Deterministic Annealing Approach , 2011, IEEE Journal of Selected Topics in Signal Processing.

[24]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[25]  Lambert Schomaker,et al.  Towards Explainable Writer Verification and Identification Using Vantage Writers , 2007 .

[26]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Bio-inspired Optimization Techniques for SVM Parameter Tuning , 2008, 2008 10th Brazilian Symposium on Neural Networks.

[27]  Lambert Schomaker,et al.  A Path Planning for Line Segmentation of Handwritten Documents , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[28]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  LiuCheng-Lin,et al.  Online and offline handwritten Chinese character recognition , 2013 .

[30]  S. Sathiya Keerthi,et al.  Evaluation of simple performance measures for tuning SVM hyperparameters , 2003, Neurocomputing.

[31]  Xin Wang,et al.  Ink Annotations and their Anchoring in Heterogeneous Digital Documents , 2007 .