What's in a URL: Fast Feature Extraction and Malicious URL Detection

Phishing is an online social engineering attack with the goal of digital identity theft carried out by pretending to be a legitimate entity. The attacker sends an attack vector commonly in the form of an email, chat session, blog post etc., which contains a link (URL) to a malicious website hosted to elicit private information from the victims. We focus on building a system for URL analysis and classification to primarily detect phishing attacks. URL analysis is attractive to maintain distance between the attacker and the victim, rather than visiting the website and getting features from it. It is also faster than Internet search, retrieving content from the destination website and network-level features used in previous research. We investigate several facets of URL analysis, e.g., performance analysis on both balanced and unbalanced datasets in a static as well as live experimental setup and online versus batch learning.

[1]  Markus Jakobsson,et al.  Phishing and Countermeasures: Understanding the Increasing Problem of Electronic Identity Theft , 2006 .

[2]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[3]  Thamar Solorio,et al.  Lexical feature based phishing URL detection using online learning , 2010, AISec '10.

[4]  Koby Crammer,et al.  Adaptive regularization of weight vectors , 2009, Machine Learning.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[7]  Efstathios Stamatatos,et al.  Spam Detection Using Character N-Grams , 2006, SETN.

[8]  Shai Shalev-Shwartz,et al.  Online learning: theory, algorithms and applications (למידה מקוונת.) , 2007 .

[9]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[10]  Jianyi Zhang,et al.  A real-time automatic detection of phishing URLs , 2012, Proceedings of 2012 2nd International Conference on Computer Science and Network Technology.

[11]  References , 1971 .

[12]  Heejo Lee,et al.  Detecting Malicious Web Links and Identifying Their Attack Types , 2011, WebApps.

[13]  Koby Crammer,et al.  Confidence-weighted linear classification , 2008, ICML '08.

[14]  Rakesh M. Verma,et al.  Catching Classical and Hijack-Based Phishing Attacks , 2014, ICISS.

[15]  Rakesh M. Verma,et al.  On the Character of Phishing URLs: Accurate and Robust Statistical Learning Classifiers , 2015, CODASPY.

[16]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[17]  Manuel Montes-y-Gómez,et al.  Evaluating a semisupervised approach to phishing url identification in a realistic scenario , 2011, CEAS '11.

[18]  Radu State,et al.  Proactive Discovery of Phishing Related Domain Names , 2012, RAID.

[19]  Steven C. H. Hoi,et al.  Cost-sensitive online active learning with application to malicious URL detection , 2013, KDD.

[20]  Justin Tung Ma,et al.  Learning to detect malicious URLs , 2011, TIST.

[21]  Michalis Faloutsos,et al.  PhishDef: URL names say it all , 2010, 2011 Proceedings IEEE INFOCOM.

[22]  Jason Hong,et al.  The state of phishing attacks , 2012, Commun. ACM.

[23]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[24]  Rakesh M. Verma,et al.  Detecting Phishing Emails the Natural Language Way , 2012, ESORICS.

[25]  Xiao Han,et al.  PhishEye: Live Monitoring of Sandboxed Phishing Kits , 2016, CCS.