Can A Priori Probabilities Help in Character Recognition?

Tests indicate t ha t use of certain Crypta.nalysis techniques show promise as a method to improve character recognition by computer . All languages exhibi t certain peculiar character is t ics such as let ter combinat ions, f requency of occurrence of let ters and init ial and terminal le t ters of words. These a t t r i b u t e s can therefore be used to improve charac ter recognition. T e s t s run on Engl ish text indicate t ha t the use of s ta t is t ics (m the occurrence of twolet ter combinat ions can appreciably improve charac te r recognit ion in the presence of noise in the input channel. This fact leads one to believe t h a t using this and other techniques could dramat ical ly improve the abil i ty of character recognit ion methods to filter out noisy input and improve accuracy. There has been considerable work done recently in the field of using computers to recognize printed or typed characters [1, 2, 3]. To the authors' knowledge there has been little, if any, use of the fact that most written languages have certain letter patterns which occur often and that certain other patterns are unlikely. In fact, one observer comments that "A certain recognition technique perhaps should be coupled with other procedures to obtain truly effective, easily in~plemented and reliable character recognition. The loss introduced by a mixed recognition system appears to be solely one of elegance" [4]. Cryptographers, of course, have used their knowledge of letter patterns to considerable advantage in decoding secret messages. It is well known, for instance, that in English text the most frequently used letter is E and that the letter T is most often the first letter of a word. Many other facts are also known about combination of letters, frequency of letter occurrence, etc. Could not this knowledge be used as information to a cornputer to allow it to more accurately read a text consisting of alphabetic characters only? Some preliminary tests indicate that such a method would improve accuracy of character recognition systems. Most character recognition methods are affected by "noise" from poor printing, dirt on the paper and similar conditions. The computer is then faced with the problem of deciding if the input is due to actual data or extraneous information. The letter H might be easily distorted into an A or an R by noise in the input. In instances like these the human brain easily supplies the proper letter by using context. For instance, the word C-W must either be COW or CAW since none of the other letters in the alphabet result in proper English words. While it would no doubt be difficult to "teach" a computer enough facts to perform this type of reasoning, there is good data available on the occurrence of digraphs or combination of two letters in the conunon languages. There are available at least two tabulations which show the occurrence in English of the various letters of the alphabet taken two at a time [5, 6]. 465 Journal of the Association for Computing Machinery, Vol. 11, No. 4 (October, 1964), pp. 465-470 4 6 6 A.W. ED%VARDS A N D R. L. C H A M B E R S

[1]  W. W. Bledsoe,et al.  Pattern recognition and reading by machine , 1959, IRE-AIEE-ACM '59 (Eastern).

[2]  Franz L. Alt,et al.  Digital Pattern Recognition by Moments , 1962, JACM.