Named Entity Recognition from Unstructured Handwritten Document Images

Named entity recognition is an important topic in the field of natural language processing, whereas in document image processing, such recognition is quite challenging without employing any linguistic knowledge. In this paper we propose an approach to detect named entities (NEs) directly from offline handwritten unstructured document images without explicit character/word recognition, and with very little aid from natural language and script rules. At the preprocessing stage, the document image is binarized, and then the text is segmented into words. The slant/skew/baseline corrections of the words are also performed. After preprocessing, the words are sent for NE recognition. We analyze the structural and positional characteristics of NEs and extract some relevant features from the word image. Then the BLSTM neural network is used for NE recognition. Our system also contains a post-processing stage to reduce the true NE rejection rate. The proposed approach produces encouraging results on both historical and modern document images, including those from an Australian archive, which are reported here for the very first time.

[1]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[2]  Horst Bunke,et al.  Using a Statistical Language Model to Improve the Performance of an HMM-Based Cursive Handwriting Recognition System , 2001, Int. J. Pattern Recognit. Artif. Intell..

[3]  Guoqiang Peter Zhang,et al.  Neural networks for classification: a survey , 2000, IEEE Trans. Syst. Man Cybern. Part C.

[4]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[5]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[6]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[8]  J. Mayer,et al.  On the Quantum Correction for Thermodynamic Equilibrium , 1947 .

[9]  C. V. Jawahar,et al.  BLSTM Neural Network Based Word Retrieval for Hindi Documents , 2011, 2011 International Conference on Document Analysis and Recognition.

[10]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[11]  Volkmar Frinken,et al.  Adapting BLSTM Neural Network Based Keyword Spotting Trained on Modern Data to Historical Documents , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[12]  Congfu Xu,et al.  Understanding research field evolving and trend with dynamic Bayesian networks , 2007 .

[13]  Ioannis Pratikakis,et al.  A combined approach for the binarization of handwritten document images , 2014, Pattern Recognit. Lett..

[14]  Nikos Fakotakis,et al.  Slant estimation algorithm for OCR systems , 2001, Pattern Recognit..

[15]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[16]  H. Zaragoza,et al.  Confidence Measures for Neural Network Classifiers , 1998 .

[17]  Guangyu Zhu,et al.  Extracting relevant named entities for automated expense reimbursement , 2007, KDD '07.