A novel framework for automatic sorting of postal documents with multi-script address blocks

Recognition of numeric postal codes in a multi-script environment is a classical problem in any postal automation system. In such postal documents, determination of the script of the handwritten postal codes is crucial for subsequent invocation of the digit recognizers for respective scripts. The current framework attempts to infer about the script of the numeric postal code without having any bias from the script of the textual address part of the rest of the address block, as they might differ in a potential multi-script environment. Scope of the current work is to recognize the postal codes written in any of the four popular scripts, viz., Latin, Devanagari, Bangla and Urdu. For this purpose, we first implement a Hough transformation based technique to localize the postal-code blocks from structured postal documents with defined address block region. Isolated handwritten digit patterns are then extracted from the localized postal-code region. In the next stage of the developed framework, similar shaped digit patterns of the said four scripts are grouped in 25 clusters. A script independent unified pattern classifier is then designed to classify the numeric postal codes into one of these 25 clusters. Based on these classification decisions a rule-based script inference engine is designed to infer about the script of the numeric postal code. One of the four script specific classifiers is subsequently invoked to recognize the digit patterns of the corresponding script. A novel quad-tree based image partitioning technique is also developed in this work for effective feature extraction from the numeric digit patterns. The average recognition accuracy over ten-fold cross validation of results for the support vector machine (SVM) based 25-class unified pattern classifier is obtained as 92.03%. With randomly selected six-digit numeric strings of four different scripts; an average of 96.72% script inference accuracy is achieved. The average of tenfold cross-validation recognition accuracies of the individual SVM classifiers for the Latin, Devanagari, Bangla and Urdu numerals are observed as 95.55%, 95.63%, 97.15% and 96.20%, respectively.

[1]  Bidyut Baran Chaudhuri,et al.  Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[2]  Changsong Liu,et al.  Gabor filters-based feature extraction for character recognition , 2005, Pattern Recognit..

[3]  Jayanthi Sivaswamy,et al.  Script Identification from Indian Documents , 2006, Document Analysis Systems.

[4]  Cheng-Lin Liu,et al.  Gabor feature extraction for character recognition: comparison with gradient feature , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[5]  Bidyut Baran Chaudhuri,et al.  Automatic identification of English, Chinese, Arabic, Devnagari and Bangla script line , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[6]  Pengfei Shi,et al.  Handwritten Bangla numeral recognition system and its application to postal automation , 2007, Pattern Recognit..

[7]  Fumitaka Kimura,et al.  Two-stage Recognition of Handwritten Bangla Alphanumeric Characters using Neural Classifiers , 2005, IICAI.

[8]  Sabri A. Mahmoud,et al.  Recognition of writer-independent off-line handwritten Arabic (Indian) numerals using hidden Markov models , 2008, Signal Process..

[9]  Malayappan Shridhar,et al.  On Recognition of Handwritten Bangla Characters , 2006, ICVGIP.

[10]  Ujjwal Bhattacharya,et al.  Neural Combination of ANN and HMM for Handwritten Devanagari Numeral Recognition , 2006 .

[11]  Bidyut Baran Chaudhuri,et al.  A system for Indian postal automation , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[12]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Santanu Chaudhury,et al.  Trainable script identification strategies for Indian languages , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[14]  Tetsushi Wakabayashi,et al.  Handwritten Numeral Recognition of Six Popular Indian Scripts , 2007 .

[15]  Bidyut Baran Chaudhuri,et al.  Identification of different script lines from multi-script documents , 2002, Image Vis. Comput..

[16]  U. Pal,et al.  Multi-script line identification from Indian documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[17]  Bidyut Baran Chaudhuri,et al.  Word-Wise Script Identification from Indian Documents , 2004, Document Analysis Systems.

[18]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  U. Pal,et al.  A system for word-wise handwritten script identification for Indian postal automation , 2004, Proceedings of the IEEE INDICON 2004. First India Annual Conference, 2004..

[20]  Vivek Singhal,et al.  Script-based classification of hand-written text documents in a multilingual environment , 2003, Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation.

[21]  Subhadip Basu,et al.  A Two-Pass Approach to Pattern Classification , 2004, ICONIP.

[22]  Mahantapas Kundu,et al.  Handwritten Bangla Digit Recognition Using Classifier Combination Through DS Technique , 2005, PReMI.

[23]  Subhadip Basu,et al.  A hierarchical approach to recognition of handwritten Bangla characters , 2009, Pattern Recognit..

[24]  Rafael C. González,et al.  Local Determination of a Moving Contrast Edge , 1985, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Bidyut Baran Chaudhuri,et al.  A system towards Indian postal automation , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[26]  Yue Lu,et al.  Bangla/English Script Identification Based on Analysis of Connected Component Profiles , 2006, Document Analysis Systems.

[27]  Jayanthi Sivaswamy,et al.  A generalised framework for script identification , 2007, International Journal of Document Analysis and Recognition (IJDAR).