Finding text in color images

In this paper, we consider the problem of locating and extracting text from WWW images. A previous algorithm based on color clustering and connected components analysis works well as long as the color of each character is relatively uniform and the typography is fairly simple. It breaks down quickly, however, when these assumptions are violated. In this paper, we describe more robust techniques for dealing with this challenging problem. We present an improved color clustering algorithm that measures similarity based on both RGB and spatial proximity. Layout analysis is also incorporated to handle more complex typography. THese changes significantly enhance the performance of our text detection procedure.

[1]  Daniel P. Lopresti,et al.  Extracting text from WWW images , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[2]  Daniel P. Lopresti,et al.  Document Analysis and the World Wide Web , 1996, DAS.

[3]  Anil K. Jain,et al.  Locating text in complex color images , 1995, Pattern Recognit..

[4]  Yasuo Ariki,et al.  Indexing and classification of TV news articles based on telop recognition , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[5]  Daniel P. Lopresti,et al.  OCR for World Wide Web images , 1997, Electronic Imaging.

[6]  Anil K. Jain,et al.  Address block location on complex mail pieces , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[7]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..