Textual and Visual Content-Based Anti-Phishing: A Bayesian Approach

A novel framework using a Bayesian approach for content-based phishing web page detection is presented. Our model takes into account textual and visual contents to measure the similarity between the protected web page and suspicious web pages. A text classifier, an image classifier, and an algorithm fusing the results from classifiers are introduced. An outstanding feature of this paper is the exploration of a Bayesian model to estimate the matching threshold. This is required in the classifier for determining the class of the web page and identifying whether the web page is phishing or not. In the text classifier, the naive Bayes rule is used to calculate the probability that a web page is phishing. In the image classifier, the earth mover's distance is employed to measure the visual similarity, and our Bayesian model is designed to determine the threshold. In the data fusion algorithm, the Bayes theory is used to synthesize the classification results from textual and visual content. The effectiveness of our proposed approach was examined in a large-scale dataset collected from real phishing cases. Experimental results demonstrated that the text classifier and the image classifier we designed deliver promising results, the fusion algorithm outperforms either of the individual classifiers, and our model can be adapted to different phishing cases.

[1]  Xiaotie Deng,et al.  Detection of phishing webpages based on visual similarity , 2005, WWW '05.

[2]  S. Pizer,et al.  The Image Processing Handbook , 1994 .

[3]  K. Dahal,et al.  Intelligent Phishing Website Detection System using Fuzzy Techniques , 2008, 2008 3rd International Conference on Information and Communication Technologies: From Theory to Applications.

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Xiaotie Deng,et al.  Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover's Distance (EMD) , 2006, IEEE Transactions on Dependable and Secure Computing.

[6]  Gang Liu,et al.  Discovering phishing target based on semantic link network , 2010 .

[7]  Mounia Lalmas,et al.  Dempster-Shafer's theory of evidence applied to structured documents: modelling uncertainty , 1997, SIGIR '97.

[8]  Ophir Frieder,et al.  On understanding and classifying web queries , 2006 .

[9]  Lance James,et al.  Phishing exposed , 2005 .

[10]  Xiaotie Deng,et al.  Phishing Web page detection , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[11]  John C. Mitchell,et al.  Client-Side Defense Against Web-Based Identity Theft , 2004, NDSS.

[12]  Fiona Fui-Hoon Nah,et al.  A study on tolerable waiting time: how long are Web users willing to wait? , 2004, AMCIS.

[13]  Lorrie Faith Cranor,et al.  Phinding Phish: An Evaluation of Anti-Phishing Toolbars , 2007, NDSS.

[14]  John C. Russ,et al.  The Image Processing Handbook , 2016, Microscopy and Microanalysis.

[15]  Suku Nair,et al.  A comparison of machine learning techniques for phishing detection , 2007, eCrime '07.

[16]  Min Wu,et al.  Web wallet: preventing phishing attacks by revealing user intentions , 2006, SOUPS '06.

[17]  Min Wu,et al.  Do security toolbars actually prevent phishing attacks? , 2006, CHI.

[18]  Dariu Gavrila,et al.  A Bayesian, Exemplar-Based Approach to Hierarchical Shape Matching , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Annabella Astorino,et al.  Scaling Up Support Vector Machines Using Nearest Neighbor Condensation , 2010, IEEE Transactions on Neural Networks.

[20]  A. Emigh,et al.  Online Identity Theft: Phishing Technology, Chokepoints and Countermeasures , 2005 .

[21]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[22]  Andrew H. Sung,et al.  Detection of Phishing Attacks: A Machine Learning Approach , 2008, Soft Computing Applications in Industry.

[23]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[25]  Linfeng Li,et al.  Usability evaluation of anti-phishing toolbars , 2007, Journal in Computer Virology.

[26]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[27]  Robert Wilensky,et al.  Robust Hyperlinks and Locations , 2000, D Lib Mag..

[28]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[29]  Brent Waters,et al.  A convenient method for securely managing passwords , 2005, WWW '05.

[30]  J. Doug Tygar,et al.  The battle against phishing: Dynamic Security Skins , 2005, SOUPS '05.

[31]  Tommy W. S. Chow,et al.  A new image classification technique using tree-structured regional features , 2007, Neurocomputing.

[32]  Zhouyu Fu,et al.  Recognition of Pornographic Web Pages by Classifying Texts and Images , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[34]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[35]  Yatong Zhou,et al.  Analysis of the Distance Between Two Classes for Tuning SVM Hyperparameters , 2010, IEEE Transactions on Neural Networks.

[36]  Juan Pablo Hourcade,et al.  B-APT: Bayesian Anti-Phishing Toolbar , 2008, 2008 IEEE International Conference on Communications.

[37]  Yossi Matias,et al.  How to Make Personalized Web Browising Simple, Secure, and Anonymous , 1997, Financial Cryptography.

[38]  Xiaotie Deng,et al.  An antiphishing strategy based on visual similarity assessment , 2006, IEEE Internet Computing.

[39]  Beng Chin Ooi,et al.  Fast signature-based color-spatial image retrieval , 1997, Proceedings of IEEE International Conference on Multimedia Computing and Systems.

[40]  Yan Liu,et al.  Tensor Distance Based Multilinear Locality-Preserved Maximum Information Embedding , 2010, IEEE Transactions on Neural Networks.