MALURLS: A Lightweight Malicious Website Classification Based on URL Features

Surfing the World Wide Web (WWW) is becoming a dangerous everyday task with the Web becoming rich in all sorts of attacks. Websites are a major source of many scams, phishing attacks, identity theft, SPAM commerce and malwares. However, browsers, blacklists and popup blockers are not enough to protect users. That requires fast and accurate systems with the ability to detect new malicious content. We propose a lightweight system to detect malicious websites online based on URL lexical and host features and call it MALURLs. The system relies on Naive Bayes classifier as a probabilistic model to detect if the target website is a malicious or benign. It introduces new features and employs self learning using Genetic Algorithm to improve the classification speed and precision. A small dataset is collected and expanded through GA mutations to learn the system over short time and with low memory usage. A completely independent testing dataset is automatically gathered and verified using different trusted web sources. They algorithm achieves an average precision of 87%.

[1]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[2]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[3]  Giovanni Vigna,et al.  Prophiler: a fast filter for the large-scale detection of malicious web pages , 2011, WWW.

[4]  P. Komisarczuk,et al.  Identification of Malicious Web Pages with Static Heuristics , 2008, 2008 Australasian Telecommunication Networks and Applications Conference.

[5]  Niels Provos,et al.  A framework for detection and measurement of phishing attacks , 2007, WORM '07.

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  Monther Aldwairi,et al.  MALURLs: Malicious URLs Classification System , 2011 .

[8]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[9]  Christopher Krügel,et al.  Detection and analysis of drive-by-download attacks and malicious JavaScript code , 2010, WWW '10.

[10]  Ramana Rao Kompella,et al.  PhishNet: Predictive Blacklisting to Detect Phishing Attacks , 2010, 2010 Proceedings IEEE INFOCOM.

[11]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[12]  Justin Tung Ma,et al.  Learning to detect malicious URLs , 2011, TIST.

[13]  Michalis Faloutsos,et al.  PhishDef: URL names say it all , 2010, 2011 Proceedings IEEE INFOCOM.

[14]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[15]  John R. Koza,et al.  Genetic Programming IV: Routine Human-Competitive Machine Intelligence , 2003 .

[16]  Dustin Burke,et al.  Real-Time Detection of Fast Flux Service Networks , 2009, 2009 Cybersecurity Applications & Technology Conference for Homeland Security.

[17]  Minaxi Gupta,et al.  Behind Phishing: An Examination of Phisher Modi Operandi , 2008, LEET.

[18]  Coniferous softwood GENERAL TERMS , 2003 .