Beyond blacklists: learning to detect malicious web sites from suspicious URLs

Malicious Web sites are a cornerstone of Internet criminal activities. As a result, there has been broad interest in developing systems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lexical and host-based properties of malicious Web site URLs. These methods are able to learn highly predictive models by extracting and automatically analyzing tens of thousands of features potentially indicative of suspicious URLs. The resulting classifiers obtain 95-99% accuracy, detecting large numbers of malicious Web sites from their URLs, with only modest false positives.

[1]  Víctor Pàmies,et al.  Open Directory Project , 2003 .

[2]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[3]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[4]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[5]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[6]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[7]  Steven D. Gribble,et al.  A Crawler-based Study of Spyware in the Web , 2006, NDSS.

[8]  Xuxian Jiang,et al.  Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities , 2006, NDSS.

[9]  Lawrence K. Saul,et al.  Multiplicative Updates for L1-Regularized Linear and Logistic Regression , 2007, IDA.

[10]  Damien Deville,et al.  SpyProxy: Execution-based Detection of Malicious Web Content , 2007, USENIX Security Symposium.

[11]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[12]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[13]  Stefan Savage,et al.  Spamscatter: Characterizing Internet Scam Hosting Infrastructure , 2007, USENIX Security Symposium.

[14]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[15]  Niels Provos,et al.  A framework for detection and measurement of phishing attacks , 2007, WORM '07.

[16]  Hao Chen,et al.  A Quantitative Study of Forum Spamming Using Context-based Analysis , 2007, NDSS.

[17]  Suku Nair,et al.  A comparison of machine learning techniques for phishing detection , 2007, eCrime '07.

[18]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[19]  Minaxi Gupta,et al.  Behind Phishing: An Examination of Phisher Modi Operandi , 2008, LEET.

[20]  Phillip A. Porras,et al.  Highly Predictive Blacklisting , 2008, USENIX Security Symposium.

[21]  Niels Provos,et al.  All Your iFRAMEs Point to Us , 2008, USENIX Security Symposium.

[22]  Gerhard Paass,et al.  Improved Phishing Detection using Model-Based Features , 2008, CEAS.

[23]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[24]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[25]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.