Searching Off-line Arabic Documents

Currently an abundance of historical manuscripts, journals, and scientific notes remain largely unaccessible in library archives. Manual transcription and publication of such documents is unlikely, and automatic transcription with high enough accuracy to support a traditional text search is difficult. In this work we describe a lexicon-free system for performing text queries on off-line printed and handwritten Arabic documents. Our segmentation-based approach utilizes gHMMs with a bigram letter transition model, and KPCA/LDA for letter discrimination. The segmentation stage is integrated with inference. We show that our method is robust to varying letter forms, ligatures, and overlaps. Additionally, we find that ignoring letters beyond the adjoining neighbors has little effect on inference and localization, which leads to a significant performance increase over standard dynamic programming. Finally, we discuss an extension to perform batch searches of large word lists for indexing purposes.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  Volker Märgner,et al.  HMM based approach for handwritten arabic word recognition using the IFN/ENIT - database , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[3]  Anthony J. Robinson,et al.  An Off-Line Cursive Handwriting Recognition System , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Yann LeCun,et al.  Off Line Recognition of Handwritten Postal Words Using Neural Networks , 1993, Int. J. Pattern Recognit. Artif. Intell..

[5]  Adnan Amin,et al.  Off line Arabic character recognition: a survey , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[6]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[7]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  R. Manmatha,et al.  Holistic word recognition for handwritten historical documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[9]  Tapas Kanungo,et al.  Performance evaluation of two Arabic OCR products , 1999, Other Conferences.

[10]  Yee Whye Teh,et al.  Making Latin Manuscripts Searchable using gHMMs , 2004, NIPS.

[11]  David A. Forsyth,et al.  Searching for Character Models , 2005, NIPS.

[12]  Adnan Amin Structural Description to Recognising Arabic Characters Using Decision Tree Learning Techniques , 2002, SSPR/SPR.

[13]  Sargur N. Srihari,et al.  A word shape analysis approach to lexicon based word recognition , 1992, Pattern Recognit. Lett..

[14]  W SeniorAndrew,et al.  An Off-Line Cursive Handwriting Recognition System , 1998 .