Suspicious URL Filtering Based on Logistic Regression with Multi-view Analysis

The current malicious URLs detecting techniques based on whole URL information are hard to detect the obfuscated malicious URLs. The most precise way to identify a malicious URL is verifying the corresponding web page contents. However, it costs very much in time, traffic and computing resource. Therefore, a filtering process that detecting more suspicious URLs which should be further verified is required in practice. In this work, we propose a suspicious URL filtering approach based on multi-view analysis in order to reduce the impact from URL obfuscation techniques. URLs are composed of several portions, each portion has a specific use. The proposed method intends to learn the characteristics from multiple portions (multi-view) of URLs for giving the suspicion level of each portion. Adjusting the suspicion threshold of each portion, the proposed system would select the most suspicious URLs. This work uses the real dataset from T. Co. to evaluate the proposed system. The requests from T. Co. are (1) detection rate should be less than 25%, (2) missing rate should be lower than 25%, and (3) the process with one hour data should be end in an hour. The experiment results show that our approach is effective, is capable to reserve more malicious URLs in the selected suspicious ones and satisfy the requests given by practical environment, such as T. Co. daily works.

[1]  S. Moonesamy The "about" URI Scheme , 2012, RFC.

[2]  Michalis Faloutsos,et al.  PhishDef: URL names say it all , 2010, 2011 Proceedings IEEE INFOCOM.

[3]  Steven D. Gribble,et al.  A Crawler-based Study of Spyware in the Web , 2006, NDSS.

[4]  Damien Deville,et al.  SpyProxy: Execution-based Detection of Malicious Web Content , 2007, USENIX Security Symposium.

[5]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[6]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[7]  Gerhard Paass,et al.  Improved Phishing Detection using Model-Based Features , 2008, CEAS.

[8]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[9]  痛并快乐着 McAfee SiteAdvisor,让我们的搜索更安全 , 2007 .

[10]  Suku Nair,et al.  A comparison of machine learning techniques for phishing detection , 2007, eCrime '07.

[11]  Niels Provos,et al.  All Your iFRAMEs Point to Us , 2008, USENIX Security Symposium.

[12]  Niels Provos,et al.  A framework for detection and measurement of phishing attacks , 2007, WORM '07.

[13]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[14]  Justin Tung Ma,et al.  Learning to detect malicious URLs , 2011, TIST.

[15]  Minaxi Gupta,et al.  Behind Phishing: An Examination of Phisher Modi Operandi , 2008, LEET.

[16]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[17]  Koby Crammer,et al.  Confidence-weighted linear classification , 2008, ICML '08.

[18]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[19]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[20]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[21]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[22]  Xuxian Jiang,et al.  Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities , 2006, NDSS.