CURLA: Cloud-Based Spam URL Analyzer for Very Large Datasets

URL blacklisting is a widely used technique for blocking phishing websites. To prepare an effective blacklist, it is necessary to analyze possible threats and include the identified malicious sites in the blacklist. Spam emails are good source for acquiring suspected phishing websites. However, the number of URLs gathered from spam emails is quite large. Fetching and analyzing the content of this large number of websites are very expensive tasks given limited computing and storage resources. Moreover, a high percentage of URLs extracted from spam emails refer to the same website. Hence, preserving the contents of all the websites causes significant storage waste. To solve the problem of massive computing and storage resource requirements, we propose and develop CURLA - a Cloud-based spam URL Analyzer, built on top of Amazon Elastic Computer Cloud (EC2) and Amazon Simple Queue Service (SQS). CURLA allows processing large number of spam-based URLs in parallel, which reduces the cost of establishing equally capable local infrastructure. Our system builds a database of unique spam-based URLs and accumulates the content of these unique websites in a central repository, which can be later used for phishing or other counterfeit websites detection. We show the effectiveness of our proposed architecture using real-life spam-based URL data.

[1]  Ragib Hasan,et al.  Cloud Based Content Fetching: Using Cloud Infrastructure to Obfuscate Phishing Scam Analysis , 2012, 2012 IEEE Eighth World Congress on Services.

[2]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[3]  Ian Sommerville,et al.  Cloud Migration: A Case Study of Migrating an Enterprise IT System to IaaS , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[4]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[5]  Anthony Skjellum,et al.  High-performance content-based phishing attack detection , 2011, 2011 eCrime Researchers Summit.

[6]  Markus Jakobsson,et al.  Phishing and Countermeasures: Understanding the Increasing Problem of Electronic Identity Theft , 2006 .

[7]  Gary Warner,et al.  Automating phishing website identification through deep MD5 matching , 2008, 2008 eCrime Researchers Summit.

[8]  Ragib Hasan,et al.  How Much Does Storage Really Cost? Towards a Full Cost Accounting Model for Data Storage , 2013, GECON.

[9]  Stephen Groat,et al.  GoldPhish: Using Images for Content-Based Phishing Analysis , 2010, 2010 Fifth International Conference on Internet Monitoring and Protection.

[10]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[11]  Xuhua Ding,et al.  Anomaly Based Web Phishing Page Detection , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).

[12]  Shuai Ding,et al.  LARX: Large-Scale Anti-Phishing by Retrospective Data-Exploring Based on a Cloud Computing Platform , 2011, 2011 Proceedings of 20th International Conference on Computer Communications and Networks (ICCCN).