Text localization in spam image using edge features

Nowadays more and more spam emails convey spam messages in a human readable image instead of text, making detection by conventional content filters difficult. However, the text information contained in spam images can be very useful for spam detection. Our goal in this paper is to propose an effective algorithm for text localization in spam images, the basic idea is to discriminate the non-text edges with some selected features of edges. Furthermore, we construct a corner detection algorithm based on a circular template to predict the corner points of the text in an image, which is crucial for text localization. Our evaluation shows that this algorithm can identify 96% of texts contained in spam images and the precision can reach up to 97.6% on real world data (spam image samples come from the SpamArchive public dataset).

[1]  Zhe Wang,et al.  Filtering Image Spam with Near-Duplicate Detection , 2007, CEAS.

[2]  S. H. Kim,et al.  Text Extraction for Spam-Mail Image Filtering Using a Text Color Estimation Technique , 2007, IEA/AIE.

[3]  James A. Herson,et al.  Image analysis for efficient categorization of image-based spam e-mail , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[4]  Zeng-Chang Qin,et al.  ROC analysis for predictions made by probabilistic classifiers , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[5]  V. Raman,et al.  Spam Detection Proposal in Regular and Text-based Image Emails , 2006, TENCON 2006 - 2006 IEEE Region 10 Conference.

[6]  Bernard Gosselin,et al.  Spatial and Color Spaces Combination for Natural Scene Text Extraction , 2006, 2006 International Conference on Image Processing.

[7]  Liu Hui,et al.  Spam Filtering based on Character Field , 2007, Second International Conference on Innovative Computing, Informatio and Control (ICICIC 2007).

[8]  Kwang-Ting Cheng,et al.  Using visual features for anti-spam filtering , 2005, IEEE International Conference on Image Processing 2005.

[9]  Rongrong Ji,et al.  Random Sampling SVM Based Soft Query Expansion for Image Retrieval , 2007, Fourth International Conference on Image and Graphics (ICIG 2007).

[10]  Xueming Qian,et al.  Text Detection, Localization and Segmentation in Compressed Videos , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Changsheng Xu,et al.  Semantic Event Extraction from Basketball Games using Multi-Modal Analysis , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[12]  Tao Zhang,et al.  Automatic Video Text Localization and Recognition , 2007, Fourth International Conference on Image and Graphics (ICIG 2007).

[13]  Krishna Subramanian,et al.  Character-Stroke Detection for Text-Localization and Extraction , 2007 .

[14]  Rainer Lienhart,et al.  Localizing and segmenting text in images and videos , 2002, IEEE Trans. Circuits Syst. Video Technol..

[15]  Zhang Yin Design of a New Color Edge Detector for Text Extraction Under Complex Background , 2001 .

[16]  Te-Ming Chang,et al.  A Cluster-based Approach to Filtering Spam under Skewed Class Distributions , 2007, 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07).

[17]  Fabio Roli,et al.  Spam Filtering Based On The Analysis Of Text Information Embedded Into Images , 2006, J. Mach. Learn. Res..