PhishZip: A New Compression-based Algorithm for Detecting Phishing Websites

Phishing has grown significantly in the past few years and is predicted to further increase in the future. The dynamics of phishing introduce challenges in implementing a robust phishing detection system and selecting features which can represent phishing despite the change of attack. In this study, we propose PhishZip which is a novel phishing detection approach using a compression algorithm to perform website classification and demonstrate a systematic way to construct the word dictionaries for the compression models using word occurrence likelihood analysis. PhishZip outperforms the use of best-performing HTML-based features in past studies, with a true positive rate of 80.04%. We also propose the use of compression ratio as a novel machine learning feature which significantly improves machine learning based phishing detection over previous studies. Using compression ratios as additional features, the true positive rate significantly improves by 30.3% (from 51.47% to 81.77%), while the accuracy increases by 11.84% (from 71.20% to 83.04%).

[1]  Peter Deutsch,et al.  ZLIB Compressed Data Format Specification version 3.3 , 1996, RFC.

[2]  Qian Cui,et al.  Tracking Phishing Attacks Over Time , 2017, WWW.

[3]  Lorenzo Cavallaro,et al.  TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time , 2018, USENIX Security Symposium.

[4]  Rainer Schrader,et al.  Sentiment Polarity Classification Using Statistical Data Compression Models , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[5]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[6]  T. Watanabe,et al.  Classification and function estimation of protein by using data compression and genetic algorithms , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[7]  Felix C. Freiling,et al.  Measuring and Detecting Fast-Flux Service Networks , 2008, NDSS.

[8]  Roy T. Fielding,et al.  Hypertext Transfer Protocol - HTTP/1.1 , 1997, RFC.

[9]  Ning Wu,et al.  On Compression-Based Text Classification , 2005, ECIR.

[10]  Pradeep K. Atrey,et al.  A survey and classification of web phishing detection schemes , 2016, Secur. Commun. Networks.

[11]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[12]  Carolyn Penstein Rosé,et al.  CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites , 2011, TSEC.

[13]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[14]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[15]  Suku Nair,et al.  A comparison of machine learning techniques for phishing detection , 2007, eCrime '07.

[16]  Youssef Iraqi,et al.  Phishing Detection: A Literature Survey , 2013, IEEE Communications Surveys & Tutorials.

[17]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[18]  Jason I. Hong,et al.  A hybrid phish detection approach by identity discovery and keywords retrieval , 2009, WWW '09.

[19]  Jason Hong,et al.  The state of phishing attacks , 2012, Commun. ACM.

[20]  Stefan Savage,et al.  Detecting and Characterizing Lateral Phishing at Scale , 2019, USENIX Security Symposium.

[21]  Mohsen Guizani,et al.  Systematization of Knowledge (SoK): A Systematic Review of Software-Based Web Phishing Detection , 2017, IEEE Communications Surveys & Tutorials.

[22]  Sherali Zeadally,et al.  A Taxonomy of Domain-Generation Algorithms , 2016, IEEE Security & Privacy.

[23]  David J. Harper,et al.  Using compression based language models for text categorization. , 2003 .

[24]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[25]  Tobias Lauinger,et al.  It's Not what It Looks Like: Measuring Attacks and Defensive Registrations of Homograph Domains , 2019, 2019 IEEE Conference on Communications and Network Security (CNS).

[26]  Tommy W. S. Chow,et al.  Textual and Visual Content-Based Anti-Phishing: A Bayesian Approach , 2011, IEEE Transactions on Neural Networks.

[27]  Christopher N. Gutierrez,et al.  Learning from the Ones that Got Away: Detecting New Forms of Phishing Attacks , 2018, IEEE Transactions on Dependable and Secure Computing.