Partitioning and searching dictionary for correction of optically read Devanagari character strings

Abstract. This paper describes a method for the correction of optically read Devanagari character strings using a Hindi word dictionary. The word dictionary is partitioned in order to reduce the search space besides preventing forced matching to an incorrect word. The dictionary partitioning strategy takes into account the underlying OCR process. The dictionary words at the top level have been divided into two partitions, namely: a short-words partition and the remaining words partition. The short-word partition is sub-partitioned using the envelope information of the words. The envelope consists of the number of top, lower, core modifiers along with the number of core charactersp. Devanagari characters are written in three strips. Most of the characters referred to as core characters are written in the middle strip. The remaining words are further partitioned using tags. A tag is a string of fixed length associated with each partition. The correction process uses a distance matrix for a assigning penalty for a mismatch. The distance matrix is based on the information about errors that the classification process is known to make and the confidence figure that the classification process associates with its output. An improvement of approximately 20% in recognition performance is obtained. For a short word, 590 words are searched on average from 14 sub-partitions of the short-words partition before an exact match is found. The average number of partitions and the average number of words increase to 20 and 1585, respectively, when an exact match is not found. For tag-based partitions, on an average, 100 words from 30 partitions are compared when either an exact match is found or a word within the preset threshold distance is found. If an exact match or a match within a preset threshold is not found, the average number of partitions becomes 75 and 450 words on an average are compared. To the best of our knowledge this is the first work on the use of a Hindi word dictionary for OCR post-processing.

[1]  Veena Bansal,et al.  Segmentation of Touching Characters in Devanagari , .

[2]  Edward M. Riseman,et al.  ContextualWord RecognitionUsing Binary Digrams , 1971 .

[3]  R. Mahesh K. Sinha,et al.  On partitioning a dictionary for visual text recognition , 1990, Pattern Recognit..

[4]  Sargur N. Srihari,et al.  Integrating diverse knowledge sources in text recognition , 1982, TOIS.

[5]  Godfried T. Toussaint,et al.  Experiments in Text Recognition with the Modified Viterbi Algorithm , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  R. Mahesh K. Sinha,et al.  Rule based contextual post-processing for devanagari text recognition , 1987, Pattern Recognit..

[7]  R.M.K. Sinha,et al.  A Programme for Correction of Single Spelling Errors in Hindi Words , 1984 .

[8]  Nobuyasu Itoh,et al.  A spelling correction method and its application to an OCR system , 1990, Pattern Recognit..

[9]  R. Mahesh K. Sinha,et al.  Visual text recognition through contextual processing , 1988, Pattern Recognit..

[10]  Gilles F. Houle,et al.  Hybrid Contextural Text Recognition with String Matching , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Tamotsu Kasai,et al.  A Method for the Correction of Garbled Words Based on the Levenshtein Metric , 1976, IEEE Transactions on Computers.

[12]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[14]  A. Lawrence Spitz An OCR based on character shape codes and lexical information , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[15]  David L. Neuhoff,et al.  The Viterbi algorithm as an aid in text recognition (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[16]  Veena Bansal,et al.  Integrating knowledge sources in Devanagari text recognition system , 2000, IEEE Trans. Syst. Man Cybern. Part A.

[17]  Allen R. Hanson,et al.  A Contextual Postprocessing System for Error Correction Using Binary n-Grams , 1974, IEEE Transactions on Computers.

[18]  Theodosios Pavlidis,et al.  On the Recognition of Printed Characters of Any Font and Size , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[20]  A. Lawrence Spitz,et al.  MULTILINGUAL DOCUMENT RECOGNITION , 1997 .

[21]  Veena Bansal Integrating Knowledge Sources in Devanagari Text Recognition , 1999 .

[22]  Bidyut Baran Chaudhuri,et al.  A complete printed Bangla OCR system , 1998, Pattern Recognit..

[23]  Sargur N. Srihari,et al.  Experiments in Text Recognition with Binary n-Gram and Viterbi Algorithms , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.