Text extraction from Web images based on a split-and-merge segmentation method using colour perception

This paper describes a complete approach to the segmentation and extraction of text from Web images for subsequent recognition, to ultimately achieve both effective indexing and presentation by non-visual means (e.g., audio). The method described here (the first in the authors' systematic approach to exploit human colour perception) enables the extraction of text in complex situations such as in the presence of varying colour (characters and background). More precisely, in addition to using structural features, the segmentation follows a split-and-merge strategy based on the hue-lightness-saturation (HLS) representation of colour as a first approximation of an anthropocentric expression of the differences in chromaticity and lightness. Character-like components are then extracted as forming textlines in a number of orientations and along curves.

[1]  G. Wyszecki,et al.  Color Science Concepts and Methods , 1982 .

[2]  Gerald M. Murch,et al.  Color displays and color science , 1987 .

[3]  Anil K. Jain,et al.  Automatic text location in images and video frames , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[4]  Daniel P. Lopresti,et al.  Locating and Recognizing Text in WWW Images , 2000, Information Retrieval.

[5]  Apostolos Antonacopoulos,et al.  Automated Interpretation of Visual Representations: Extracting Textual Information from WWW Images , 1999, Visual Representations and Interpretations.

[6]  Michael K. Brown,et al.  Web Page Analysis for Voice Browsing , 2001 .

[7]  Jianying Hu,et al.  Flexible Web document analysis for delivery to narrow-bandwidth devices , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[8]  Apostolos Antonacopoulos,et al.  Accessing textual information embedded in Internet images , 2000, IS&T/SPIE Electronic Imaging.

[9]  Daniel P. Lopresti,et al.  Extracting text from WWW images , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[10]  Daniel P. Lopresti,et al.  Document Analysis and the World Wide Web , 1996, DAS.