Phishing website detection using Latent Dirichlet Allocation and AdaBoost

One of the ways criminals steal identity in the cyberspace is using phishing. Attackers host phishing websites that resemble a legitimate website and entice users to click on hyperlinks which directs them to these fake websites. Attackers use these fake sites to capture personal information such as login, passwords and social security numbers from innocent victims, which they later use to commit crimes. We propose here a robust methodology to detect phishing websites that employs for semantic analysis a topic modeling technique, Latent Dirichlet Allocation, and for classification, AdaBoost. The methodology developed is a content driven approach that is device independent and language neutral. The website content of mobile and desktop clients are collected by employing an intelligent web crawler. The website contents that are not in English are translated to English using Google's language translator. Topic model is built using the translated contents of desktop and mobile clients. The phishing website classifier is built using (i) distribution probabilities for the topics found as features using Latent Dirichlet Allocation and (ii) AdaBoost voting technique. Experiments were conducted using one of the large public corpus of website data containing 47500 phishing websites and 52500 good websites. Results show that our method achieves a F-measure of 99%.

[1]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[2]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[3]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[4]  Cheng Hsin Hsu,et al.  Identify fixed-path phishing attack by STC , 2011, CEAS '11.

[5]  Eibe Frank,et al.  Speeding Up Logistic Model Tree Induction , 2005, PKDD.

[6]  Carolyn Penstein Rosé,et al.  CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites , 2011, TSEC.

[7]  A. Porter Phishing on Mobile Devices , 2011 .

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  Fadi Thabtah,et al.  Associative Classification techniques for predicting e-banking phishing websites , 2010, 2010 International Conference on Multimedia Computing and Information Technology (MCIT).

[10]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Mingxing He,et al.  An efficient phishing webpage detector , 2011, Expert Syst. Appl..

[12]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[13]  Youssef Iraqi,et al.  A novel Phishing classification based on URL features , 2011, 2011 IEEE GCC Conference and Exhibition (GCC).

[14]  Tommy W. S. Chow,et al.  Textual and Visual Content-Based Anti-Phishing: A Bayesian Approach , 2011, IEEE Transactions on Neural Networks.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[18]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .