Image and video text recognition using convolutional neural networks

Thanks to increasingly powerful storage media, multimedia resources have become nowadays essential resources, in the field of information and broadcasting (News Agency, INA), culture (museums), transport (monitoring), environment (satellite images), or medical imaging (medical records in hospitals). Thus, the challenge is how to quickly find relevant information. Therefore, research in multimedia is increasingly focused on indexing and retrieval techniques. To accomplish this task, the text within images and videos can be a relevant key. The challenges of recognizing text in images and videos are many: poor resolution, characters of different sizes, artifacts due to compression and effects of anti-recovery, very complex and variable background. There are four steps for the recognition of the text: (1) detecting the presence of the text, (2) localizing of the text, (3) extracting and enhancing the text area, and finally (4) recognizing the content of the text. In this work we will focus on this last step and we assume that the text box has been detected, located and retrieved correctly. This recognition module can also be divided into several sub-modules such as a binarization module, a text segmentation module, a character recognition module. We focused on a particular machine learning algorithm called convolutional neural networks (CNNs). These are networks of neurons whose topology is similar to the mammalian visual cortex. CNNs were initially used for recognition of handwritten digits. They were then applied successfully on many problems of pattern recognition. We propose in this thesis a new method of binarization of text images, a new method for segmentation of text images, the study of a convolutional neural network for character recognition in images, a discussion on the relevance of the binarization step in the recognition of text in images based on machine learning methods, and a new method of text recognition in images based on graph theory.

[1]  Hayaru Shouno,et al.  Neocognitron with improved bend-extractors , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[2]  L. D. Earnest,et al.  Machine Recognition of Cursive Writing , 1962, IFIP Congress.

[3]  John C. Platt,et al.  A Convolutional Neural Network Hand Tracker , 1994, NIPS.

[4]  Majid Ahmadi,et al.  A binarization method for scanned documents based on hidden Markov model , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[5]  Jean-Marc Odobez,et al.  Sequential Monte Carlo video text segmentation , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  Sei-Wang Chen,et al.  Automatic license plate recognition , 2004, IEEE Transactions on Intelligent Transportation Systems.

[8]  Masakazu Matsugu,et al.  Subject independent facial expression recognition with robust face detection using a convolutional neural network , 2003, Neural Networks.

[9]  Shih-Fu Chang,et al.  A Bayesian framework for fusing multiple word knowledge models in videotext recognition , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[10]  Christophe Garcia,et al.  text Detection with Convolutional Neural Networks , 2008, VISAPP.

[11]  Toru Wakahara,et al.  Segmentation and recognition of characters in scene images using selective binarization in color space and GAT correlation , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[12]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[13]  C. Garcia,et al.  Text detection and segmentation in complex color images , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[14]  Christophe Garcia,et al.  Robust Binarization for Video Text Recognition , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[15]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[16]  Alexander Gepperth Visual object classification by sparse convolutional neural networks , 2006, ESANN.

[17]  Kunihiko Fukushima,et al.  A neural network for visual pattern recognition , 1988, Computer.

[18]  Theodosios Pavlidis,et al.  Picture Segmentation by a Tree Traversal Algorithm , 1976, JACM.

[19]  Rainer Lienhart,et al.  Localizing and segmenting text in images and videos , 2002, IEEE Trans. Circuits Syst. Video Technol..

[20]  Chitra Dorai,et al.  End-to-end videotext recognition for multimedia content analysis , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[21]  Harris Drucker,et al.  Learning algorithms for classification: A comparison on handwritten digit recognition , 1995 .

[22]  P. Ekman,et al.  Constants across cultures in the face and emotion. , 1971, Journal of personality and social psychology.

[23]  Qian Huang,et al.  Automatic license extraction from moving vehicles , 1997, Proceedings of International Conference on Image Processing.

[24]  S. Ranganath,et al.  Video-text extraction and recognition , 2004, 2004 IEEE Region 10 Conference TENCON 2004..

[25]  Rainer Lienhart,et al.  Automatic text recognition for video indexing , 1997, MULTIMEDIA '96.

[26]  Yann LeCun,et al.  Synergistic Face Detection and Pose Estimation with Energy-Based Models , 2004, J. Mach. Learn. Res..

[27]  Toru Wakahara,et al.  Binarization and Recognition of Degraded Characters Using a Maximum Separability Axis in Color Space and GAT Correlation , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[28]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[29]  Christophe Garcia,et al.  Convolutional face finder: a neural architecture for fast and robust face detection , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Jinping Li,et al.  Character Recognition Based on Hierarchical RBF Neural Networks , 2006, Sixth International Conference on Intelligent Systems Design and Applications.

[31]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[32]  David S. Doermann,et al.  Automatic text detection and tracking in digital video , 2000, IEEE Trans. Image Process..

[33]  Hao Yan,et al.  Automatic Text Detection In Video Frames Based on Bootstrap Artificial Neural Network and CED , 2003, WSCG.

[34]  Datong Chen,et al.  Text enhancement with asymmetric filter for video OCR , 2001, Proceedings 11th International Conference on Image Analysis and Processing.

[35]  Bernd Freisleben,et al.  Text detection in images based on unsupervised classification of high-frequency wavelet coefficients , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[36]  Apostolos Antonacopoulos,et al.  Text extraction from Web images based on a split-and-merge segmentation method using colour perception , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[37]  Kyusik Chung,et al.  Video caption image enhancement for an efficient character recognition , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[38]  Yoshua Bengio,et al.  Globally Trained Handwritten Word Recognizer Using Spatial Representation, Convolutional Neural Networks, and Hidden Markov Models , 1993, NIPS.

[39]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[40]  Andy C. Downton,et al.  A comparison of binarization methods for historical archive documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[41]  W. Effelsberg,et al.  Robust Character Recognition in Low-Resolution Images and Videos , 2005 .

[42]  S. Lucas,et al.  ICDAR 2003 robust reading competitions: entries, results, and future directions , 2005, International Journal of Document Analysis and Recognition (IJDAR).

[43]  Edward M. Riseman,et al.  TextFinder: An Automatic System to Detect and Recognize Text In Images , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Efstathios Stamatatos,et al.  Discrimination of machine-printed from handwritten text using simple structural characteristics , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[45]  Kunihiko Fukushima Neocognitron of a New Version: Handwritten Digit Recognition , 2001, ICANN.

[46]  Kin Hong Wong,et al.  Script recognition using hidden Markov models , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[47]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[48]  Xian-Sheng Hua,et al.  Efficient video text recognition using multiple frame integration , 2002, Proceedings. International Conference on Image Processing.

[49]  Harris Drucker,et al.  Improving Performance in Neural Networks Using a Boosting Algorithm , 1992, NIPS.

[50]  Chien-Hsing Chou,et al.  Learning to binarize document images using a decision cascade , 2005, IEEE International Conference on Image Processing 2005.

[51]  Ellen K. Hughes,et al.  Video OCR for digital news archive , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[52]  Xilin Chen,et al.  Automatic detection and recognition of signs from natural scenes , 2004, IEEE Transactions on Image Processing.

[53]  Paul A. Viola,et al.  Text recognition of low-resolution document images , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[54]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[55]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[56]  K L Lam,et al.  Computer-aided detection of mammographic microcalcifications: pattern recognition with an artificial neural network. , 1995, Medical physics.

[57]  Beat Fasel Multiscale Facial Expression Recognition Using Convolutional Neural Networks , 2002, ICVGIP.

[58]  George Nagy,et al.  At the frontiers of OCR , 1992, Proc. IEEE.

[59]  Anil K. Jain,et al.  Locating text in complex color images , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[60]  Nevenka Dimitrova,et al.  Text detection for video analysis , 1999, Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL'99).

[61]  Akira Tomono,et al.  Affine-Invariant Recognition of Gray-Scale Characters Using Global Affine Transformation Correlation , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[62]  Ullas Gargi,et al.  Indexing text events in digital video databases , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[63]  Giovanni Soda,et al.  Artificial neural networks for document analysis and recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Bernd Freisleben,et al.  Unsupervised Text Segmentation Using Color and Wavelet Features , 2004, CIVR.

[65]  Christophe Garcia,et al.  A Neural Scheme for Robust Detection of Transparent Logos in TV Programs , 2006, ICANN.

[66]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[67]  B. Kapralos,et al.  I An Introduction to Digital Image Processing , 2022 .

[68]  Takeshi Mita,et al.  Improvement of video text recognition by character selection , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[69]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[70]  Wolfgang Effelsberg,et al.  Automatic text segmentation and text recognition for video indexing , 2000, Multimedia Systems.

[71]  Evangelos A. Yfantis,et al.  An OCR-independent character segmentation using shortest-path in grayscale document images , 2007, ICMLA 2007.

[72]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[73]  Zhang Yi,et al.  Automatic Text Detection In Video Frames Based on Bootstrap Artificial Neural Network And CED , 2003 .

[74]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[75]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[76]  Chien-Huei Chen,et al.  Word recognition in a segmentation-free approach to OCR , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[77]  Michael J. Lyons,et al.  Coding facial expressions with Gabor wavelets , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[78]  Abdesselam Bouzerdoum,et al.  A new class of convolutional neural networks (SICoNNets) and their application of face detection , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[79]  Zohra Saidane,et al.  Automatic Scene Text Recognition using a Convolutional Neural Network , 2007 .

[80]  Berkman Sahiner,et al.  Classification of mass and normal breast tissue: a convolution neural network classifier with spatial domain and texture images , 1996, IEEE Trans. Medical Imaging.

[81]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[82]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[83]  A. Aydin Alatan,et al.  Utilization of texture, contrast and color homogeneity for detecting and recognizing text from video frames , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[84]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[85]  Matti Pietikäinen,et al.  Adaptive document binarization , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[86]  Baoxin Li,et al.  Building pattern classifiers using convolutional neural networks , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[87]  Matthew T. Freedman,et al.  Automatic lung nodule detection using profile matching and back-propagation neural network techniques , 1993, Journal of Digital Imaging.

[88]  Christopher Jones,et al.  An OCR-independent character segmentation using shortest-path in grayscale document images , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[89]  Hilbert J. Kappen,et al.  On-line learning processes in artificial neural networks , 1993 .

[90]  Y. Le Cun,et al.  Shortest path segmentation: a method for training a neural network to recognize character strings , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[91]  Takayuki Ito,et al.  Neocognitron: A neural network model for a mechanism of visual pattern recognition , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[92]  David S. Doermann,et al.  Binarization of low quality text using a Markov random field model , 2002, Object recognition supported by user interaction for service robots.

[93]  In-Jung Kim,et al.  Multi-window binarization of camera image for document recognition , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.