Visual Saliency Models for Text Detection in Real World

This paper evaluates the degree of saliency of texts in natural scenes using visual saliency models. A large scale scene image database with pixel level ground truth is created for this purpose. Using this scene image database and five state-of-the-art models, visual saliency maps that represent the degree of saliency of the objects are calculated. The receiver operating characteristic curve is employed in order to evaluate the saliency of scene texts, which is calculated by visual saliency models. A visualization of the distribution of scene texts and non-texts in the space constructed by three kinds of saliency maps, which are calculated using Itti's visual saliency model with intensity, color and orientation features, is given. This visualization of distribution indicates that text characters are more salient than their non-text neighbors, and can be captured from the background. Therefore, scene texts can be extracted from the scene images. With this in mind, a new visual saliency architecture, named hierarchical visual saliency model, is proposed. Hierarchical visual saliency model is based on Itti's model and consists of two stages. In the first stage, Itti's model is used to calculate the saliency map, and Otsu's global thresholding algorithm is applied to extract the salient region that we are interested in. In the second stage, Itti's model is applied to the salient region to calculate the final saliency map. An experimental evaluation demonstrates that the proposed model outperforms Itti's model in terms of captured scene texts.

[1]  George Drettakis,et al.  Bimodal perception of audio-visual material properties for virtual environments , 2010, TAP.

[2]  Christof Koch,et al.  AdaBoost for Text Detection in Natural Scene , 2011, 2011 International Conference on Document Analysis and Recognition.

[3]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[4]  Yonatan Wexler,et al.  Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[6]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[7]  Shiliang Sun,et al.  A Visual Attention Based Approach to Text Extraction , 2010, 2010 20th International Conference on Pattern Recognition.

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[9]  S Ullman,et al.  Shifts in selective visual attention: towards the underlying neural circuitry. , 1985, Human neurobiology.

[10]  K L Shapiro,et al.  Temporary suppression of visual processing in an RSVP task: an attentional blink? . , 1992, Journal of experimental psychology. Human perception and performance.

[11]  D. Simons,et al.  Failure to detect changes to attended objects in motion pictures , 1997 .

[12]  Garrison W. Cottrell,et al.  Visual saliency model for robot cameras , 2008, 2008 IEEE International Conference on Robotics and Automation.

[13]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[14]  David J. Crandall,et al.  Extraction of special effects caption text events from digital video , 2003, International Journal on Document Analysis and Recognition.

[15]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Yaokai Feng,et al.  A Keypoint-Based Approach toward Scenery Character Detection , 2011, 2011 International Conference on Document Analysis and Recognition.

[17]  Sabine Süsstrunk,et al.  Frequency-tuned salient region detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Henrik I. Christensen,et al.  Computational visual attention systems and their cognitive foundations: A survey , 2010, TAP.

[19]  C. V. Jawahar,et al.  Top-down and bottom-up cues for scene text recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Andrew Y. Ng,et al.  Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning , 2011, 2011 International Conference on Document Analysis and Recognition.

[21]  Tim K Marks,et al.  SUN: A Bayesian framework for saliency using natural statistics. , 2008, Journal of vision.

[22]  C. Koch,et al.  Faces and text attract gaze independent of the task: Experimental data and computer model. , 2009, Journal of vision.

[23]  Andreas Dengel,et al.  How Salient is Scene Text? , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[24]  F. Scharnowski,et al.  Long-lasting modulation of feature integration by transcranial magnetic stimulation. , 2009, Journal of vision.

[25]  Seiichi Uchida,et al.  Text Localization and Recognition in Images and Video , 2014, Handbook of Document Image Processing and Recognition.

[26]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[27]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.