The Influence of Language Orthographic Characteristics on Digital Word Recognition

We study the effect of language orthographic characteristics on the performance of digital word recognition in degraded documents such as historical documents. We provide a rigorous scheme for quantifying the influence of the orthographic characteristics on the quality of word recognition in such documents. We study and compare several orthographic characteristics for four natural languages and measure the effect of each individual characteristic on the digital word recognition process. To this end we create synthetic languages, for which all characteristics, except the one we examine, are identical, and measure the performance of two word recognition algorithms on synthetic documents of these languages. We examine and summarize the influence of the values of each characteristic on the performance of these word recognition methods.

[1]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Nicolas Ragot,et al.  Adaptive detection of missed text areas in OCR outputs: application to the automatic assessment of OCR quality in mass digitization projects , 2013, Electronic Imaging.

[3]  Manuel Perea,et al.  The effects of orthographic neighborhood in reading and laboratory word identification tasks: A review , 2000 .

[4]  Henry S. Baird,et al.  The State of the Art of Document Image Degradation Modelling , 2007 .

[5]  Michael Makridis,et al.  An Efficient Word Segmentation Technique for Historical and Degraded Machine-Printed Documents , 2007 .

[6]  L G Richards,et al.  Recognition thresholds as a function of word length. , 1976, The American journal of psychology.

[7]  Thomas A. Nartker,et al.  Prediction of OCR accuracy using simple image features , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[8]  Christodoulos Chamzas,et al.  Web Document Image Retrieval System Based on Word Spotting , 2006, 2006 International Conference on Image Processing.

[9]  Max Coltheart,et al.  Access to the internal lexicon , 1977 .

[10]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[11]  Réjean Plamondon,et al.  Why handwriting segmentation can be misleading? , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[12]  Sargur N. Srihari,et al.  Word image retrieval using binary features , 2003, IS&T/SPIE Electronic Imaging.

[13]  Nizar Habash,et al.  Online Arabic Handwriting Recognition Using Hidden Markov Models , 2006 .

[14]  J. Grainger Word frequency and neighborhood frequency effects in lexical decision and naming. , 1990 .

[15]  Ioannis Pratikakis,et al.  A segmentation-free approach for keyword search in historical typewritten documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[16]  R. Manmatha,et al.  Features for word spotting in historical manuscripts , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[17]  Venu Govindaraju,et al.  The Role of Holistic Paradigms in Handwritten Word Recognition , 2009 .

[18]  M. Brysbaert,et al.  Reexamining the word length effect in visual word recognition: New evidence from the English Lexicon Project , 2006, Psychonomic bulletin & review.

[19]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[20]  David S. Doermann,et al.  Unsupervised feature learning framework for no-reference image quality assessment , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  S. Andrews The effect of orthographic similarity on lexical retrieval: Resolving neighborhood conflicts , 1997 .

[23]  R. Manmatha,et al.  Word spotting for historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[24]  D. Balota,et al.  Moving beyond Coltheart’s N: A new measure of orthographic similarity , 2008, Psychonomic bulletin & review.

[25]  Sally Andrews,et al.  Frequency and neighborhood effects on lexical access: Lexical similarity or orthographic redundancy? , 1992 .

[26]  A. Lawrence Spitz Shape-based word recognition , 1999, International Journal on Document Analysis and Recognition.

[27]  S. Andrews Frequency and neighborhood effects on lexical access: Activation or search? , 1989 .

[28]  R. Manmatha,et al.  Holistic word recognition for handwritten historical documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[29]  Daniel P. Lopresti,et al.  Classification and distribution of optical character recognition errors , 1994, Electronic Imaging.