PhishMon: A Machine Learning Framework for Detecting Phishing Webpages

Despite numerous research efforts, phishing attacks remain prevalent and highly effective in luring unsuspecting users to reveal sensitive information, including account credentials and social security numbers. In this paper, we propose PhishMon, a new feature-rich machine learning framework to detect phishing webpages. It relies on a set of fifteen novel features that can be efficiently computed from a webpage without requiring third-party services, such as search engines, or WHOIS servers. These features capture various characteristics of legitimate web applications as well as their underlying web infrastructures. Emulation of these features is costly for phishers as it demands to spend significantly more time and effort on their underlying infrastructures and web applications; in addition to the efforts required for replicating the appearance of target websites. Through extensive evaluation on a dataset consisting of 4,800 distinct phishing and 17,500 distinct benign webpages, we show that PhishMon can distinguish unseen phishing from legitimate webpages with a very high degree of accuracy. In our experiments, PhishMon achieved 95.4% accuracy with 1.3% false positive rate on a dataset containing unique phishing instances.

[1]  Steve Souders High-performance web sites , 2008, CACM.

[2]  Vijay K. Vaishnavi,et al.  Predicting Maintenance Performance Using Object-Oriented Design Complexity Metrics , 2003, IEEE Trans. Software Eng..

[3]  Carolyn Penstein Rosé,et al.  CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites , 2011, TSEC.

[4]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[5]  Kang-Leng Chiew,et al.  Phishing Detection via Identification of Website Identity , 2013, 2013 International Conference on IT Convergence and Security (ICITCS).

[6]  Heejo Lee,et al.  Detecting Malicious Web Links and Identifying Their Attack Types , 2011, WebApps.

[7]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[8]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[9]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[10]  Carolyn Penstein Rosé,et al.  A Hierarchical Adaptive Probabilistic Approach for Zero Hour Phish Detection , 2010, ESORICS.

[11]  Kuan-Ta Chen,et al.  Fighting Phishing with Discriminative Keypoint Features , 2009, IEEE Internet Computing.

[12]  Jesse D. Kornblum Identifying almost identical files using context triggered piecewise hashing , 2006, Digit. Investig..

[13]  Scott Dick,et al.  An Anti-Phishing System Employing Diffused Information , 2014, TSEC.

[14]  Adrienne Porter Felt,et al.  Alice in Warningland: A Large-Scale Field Study of Browser Security Warning Effectiveness , 2013, USENIX Security Symposium.

[15]  Radu State,et al.  PhishStorm: Detecting Phishing With Streaming Analytics , 2014, IEEE Transactions on Network and Service Management.

[16]  L. Jean Camp,et al.  Beyond the lock icon: real-time detection of phishing websites using public key certificates , 2015, 2015 APWG Symposium on Electronic Crime Research (eCrime).

[17]  Samuel Marchal,et al.  Know Your Phish: Novel Techniques for Detecting Phishing Sites and Their Targets , 2015, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[18]  Ilango Krishnamurthi,et al.  A comprehensive and efficacious architecture for detecting phishing webpages , 2014, Comput. Secur..

[19]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[20]  Tyler Moore,et al.  Measuring the Perpetrators and Funders of Typosquatting , 2010, Financial Cryptography.

[21]  Fadi A. Thabtah,et al.  Predicting Phishing Websites Using Classification Mining Techniques with Experimental Case Studies , 2010, 2010 Seventh International Conference on Information Technology: New Generations.

[22]  Laurie A. Williams,et al.  Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities , 2011, IEEE Transactions on Software Engineering.