Proper noun detection in document images

Abstract An algorithm for the detection of proper nouns in document images printed in mixed upper and lower case is presented. Analysis of graphical features of words in a running text is performed to determine words that are likely to be names of specific persons, places, or objects (i.e. proper nouns). This algorithm is a useful addition to contextual post-processing (CPP) or whole word recognition techniques where word images are matched to entries in a dictionary. Due to the difficulty of creating a comprehensive list of proper nouns, a methodology of locating such words prior to recognition will allow for the use of specialized recognition strategies for those words only. Experimental results demonstrate that about 90% of all occurrences of proper nouns were located and over 97% of the unique proper nouns in a document were found using this algorithm.

[1]  Jonathan J. Hull,et al.  A hidden Markov model for language syntax in text recognition , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[2]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[3]  H. S. Yang,et al.  A Knowledge-Based Robotic Assembly Cell , 1986, IEEE Expert.

[4]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  Henry S. Baird Background Structure in Document Images , 1994, Int. J. Pattern Recognit. Artif. Intell..

[7]  Sargur N. Srihari,et al.  A word shape analysis approach to lexicon based word recognition , 1992, Pattern Recognit. Lett..

[8]  Jonathan J. Hull Hypothesis Generation in a Computational Model for Visual Word Recognition , 1986, IEEE Expert.