Extracting information from handwritten content in census forms

In this paper, we describe our approach for extracting salient information from US census form images. These forms present several challenges including variations in individual form templates, skew, writing device, writing style, etc. We describe an innovative registration algorithm that is robust to scale variations for segmenting the input image into cells. Following registration, the borders of cells are removed using a shape-based rule-line removal algorithm to extract handwritten content from each cell. Finally, the individual cell images are recognized using a hidden Markov model (HMM) OCR system with language models biased for the type of information in the cell, such as person name, place name, numbers, marital status, gender, race, etc.

[1]  Venu Govindaraju,et al.  Preprocessing of Low-Quality Handwritten Documents Using Markov Random Fields , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[3]  Rohit Prasad,et al.  A stroke regeneration method for cleaning rule-lines in handwritten document images , 2009, MOCR '09.

[4]  C. V. Jawahar,et al.  Contextual restoration of severely degraded document images , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Rohit Prasad,et al.  Improvements in BBN's HMM-Based Offline Arabic Handwriting Recognition System , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[6]  Adam Krzyzak,et al.  A new courtesy amount recognition module of a Check Reading System , 2008, 2008 19th International Conference on Pattern Recognition.

[7]  Richard M. Schwartz,et al.  The N-Best Algorithm: Efficient Procedure for Finding Top N Sentence Hypotheses , 1989, HLT.

[8]  Volker Märgner,et al.  ICDAR 2011 - Arabic Handwriting Recognition Competition , 2011, ICDAR.

[9]  Gyeonghwan Kim,et al.  A Lexicon Driven Approach to Handwritten Word Recognition for Real-Time Applications , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Daniel P. Lopresti,et al.  Evaluating the performance of table processing algorithms , 2002, International Journal on Document Analysis and Recognition.

[11]  B. Kapralos,et al.  I An Introduction to Digital Image Processing , 2022 .