Automating URL Blacklist Generation with Similarity Search Approach

Modern web users may encounter a browser security threat called drive-by-download attacks when surfing on the Internet. Drive-by-download attacks make use of exploit codes to take control of user’s web browser. Many web users do not take such underlying threats into account while clicking URLs. URL Blacklist is one of the practical approaches to thwarting browser-targeted attacks. However, URL Blacklist cannot cope with previously unseen malicious URLs. Therefore, to make a URL blacklist effective, it is crucial to keep the URLs updated. Given these observations, we propose a framework called automatic blacklist generator (AutoBLG) that automates the collection of new malicious URLs by starting from a given existing URL blacklist. The primary mechanism of AutoBLG is expanding the search space of web pages while reducing the amount of URLs to be analyzed by applying several pre-filters such as similarity search to accelerate the process of generating blacklists. AutoBLG consists of three primary components: URL expansion, URL filtration, and URL verification. Through extensive analysis using a high-performance web client honeypot, we demonstrate that AutoBLG can successfully discover new and previously unknown drive-by-download URLs from the vast web space. key words: drive-by-download, URL blacklist, search space, machine learning, web client honeypot

[1]  Takeshi Yagi,et al.  Controlling malware HTTP communications in dynamic analysis system using search engine , 2011, 2011 Third International Workshop on Cyberspace Safety and Security (CSS).

[2]  Mitsuaki Akiyama,et al.  Design and Implementation of High Interaction Client Honeypot for Drive-by-Download Attacks , 2010, IEICE Trans. Commun..

[3]  Komminist Weldemariam,et al.  BINSPECT: Holistic Analysis and Detection of Malicious Web Pages , 2012, SecureComm.

[4]  Paolo Milani Comparetti,et al.  EvilSeed: A Guided Approach to Finding Malicious Web Pages , 2012, 2012 IEEE Symposium on Security and Privacy.

[5]  Shigeki Goto,et al.  Detecting Malicious Websites by Learning IP Address Features , 2012, 2012 IEEE/IPSJ 12th International Symposium on Applications and the Internet.

[6]  Shouhuai Xu,et al.  Cross-layer detection of malicious websites , 2013, CODASPY.

[7]  Mitsuaki Akiyama,et al.  Searching Structural Neighborhood of Malicious URLs to Improve Blacklisting , 2011, 2011 IEEE/IPSJ International Symposium on Applications and the Internet.

[8]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[9]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[10]  Giovanni Vigna,et al.  Prophiler: a fast filter for the large-scale detection of malicious web pages , 2011, WWW.

[11]  Nick Feamster,et al.  Building a Dynamic Reputation System for DNS , 2010, USENIX Security Symposium.

[12]  Heejo Lee,et al.  Detecting Malicious Web Links and Identifying Their Attack Types , 2011, WebApps.

[13]  Katherine A. Heller,et al.  Bayesian Sets , 2005, NIPS.