Fuzzy String Matching with a Deep Neural Network

ABSTRACT A deep learning neural network for character-level text classification is described in this work. The system spots keywords in the text output of an optical character recognition system using memoization and by encoding the text into feature vectors related to letter frequency. Recognizing error messages in a set of generated images, dictionary and spell-check-based approaches achieved 69% to 88% accuracy, while various deep learning approaches achieved 91% to 96% accuracy, and a combination of deep learning with a dictionary achieved 97% accuracy. The contribution of this work to the state of the art is to describe a new approach for character-level deep neural network classification of noisy text.

[1]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[2]  Keith Brindley International Morse code , 1989 .

[3]  Youssef Bassil,et al.  OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion , 2012, ArXiv.

[4]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[5]  Sakis Kasampalis Mastering Python Design Patterns , 2015 .

[6]  Bidyut Baran Chaudhuri,et al.  Improving OCR for an under-resourced script using unsupervised word-spotting , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[7]  David S. Doermann,et al.  Text Detection and Recognition in Imagery: A Survey , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Hartmut Neven,et al.  PhotoOCR: Reading Text in Uncontrolled Conditions , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[10]  H. W. Moorshead INTERNATIONAL MORSE CODE , 1978 .

[11]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[12]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[13]  Andrew Zisserman,et al.  Deep Features for Text Spotting , 2014, ECCV.