Examination of data, rule generation and detection of phishing URLs using online logistic regression

Web services such as online banking, gaming, and social networking have rapidly evolved as has the reliance upon them by people to perform everyday tasks. As a result, a large amount of information is uploaded on a daily basis to the Web. The openness of the Web exposes opportunities for criminals to upload malicious content. Despite extensive research, email based spam filtering techniques are unable to protect other web services. Therefore, a counter measure must be taken that generalizes across web services to protect the user from phishing hosts. The paper describes an approach that classifies URLs automatically based on their lexical and host-based features. The usability of Mahout is demonstrated for such scalable machine learning problems, and online learning is considered over batch learning. The classifier achieves 93-97% accuracy by detecting a large number of phishing hosts, while maintaining a modest false positive rate. The raw data is examined, and the effectiveness of various feature subsets is assessed. The relevance of bigrams is assessed, and strengthened by using the chi-squared and information gain attribute evaluation methods.

[1]  Remco R. Bouckaert,et al.  Bayesian network classifiers in Weka , 2004 .

[2]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[3]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[4]  Konrad Rieck,et al.  TokDoc: a self-healing web application firewall , 2010, SAC '10.

[5]  John Wang,et al.  Data Mining Software , 2008 .

[6]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[7]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Minaxi Gupta,et al.  Behind Phishing: An Examination of Phisher Modi Operandi , 2008, LEET.

[10]  Haijia Shi Best-first Decision Tree Learning , 2007 .

[11]  Gianluca Stringhini,et al.  Detecting spammers on social networks , 2010, ACSAC '10.

[12]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[13]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[14]  Grgoire Montavon,et al.  Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[15]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[16]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[17]  Vern Paxson,et al.  On the Potential of Proactive Domain Blacklisting , 2010, LEET.

[18]  Thamar Solorio,et al.  Lexical feature based phishing URL detection using online learning , 2010, AISec '10.

[19]  Justin Tung Ma,et al.  Learning to detect malicious URLs , 2011, TIST.

[20]  Larry A. Pace,et al.  Beginning R: An Introduction to Statistical Programming , 2012 .

[21]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .