Learning from the Ones that Got Away: Detecting New Forms of Phishing Attacks

Phishing attacks continue to pose a major threat for computer system defenders, often forming the first step in a multi-stage attack. There have been great strides made in phishing detection; however, some phishing emails appear to pass through filters by making simple structural and semantic changes to the messages. We tackle this problem through the use of a machine learning classifier operating on a large corpus of phishing and legitimate emails. We design SAFe-PC (Semi-Automated Feature generation for Phish Classification), a system to extract features, elevating some to higher level features, that are meant to defeat common phishing email detection strategies. To evaluate SAFe-PC , we collect a large corpus of phishing emails from the central IT organization at a tier-1 university. The execution of SAFe-PC on the dataset exposes hitherto unknown insights on phishing campaigns directed at university users. SAFe-PC detects more than 70 percent of the emails that had eluded our production deployment of Sophos, a state-of-the-art email filtering tool. It also outperforms SpamAssassin, a commonly used email filtering tool. We also developed an online version of SAFe-PC, that can be incrementally retrained with new samples. Its detection performance improves with time as new samples are collected, while the time to retrain the classifier stays constant.

[1]  Julia M. Taylor,et al.  Using Syntactic Features for Phishing Detection , 2015, ArXiv.

[2]  Rakesh M. Verma,et al.  Detecting Phishing Emails the Natural Language Way , 2012, ESORICS.

[3]  Rakesh M. Verma,et al.  Semantic Feature Selection for Text with Application to Phishing Email Detection , 2013, ICISC.

[4]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[5]  Richard Weber,et al.  Online phishing classification using adversarial data mining and signaling games , 2010, SKDD.

[6]  Vijay K. Gurbani,et al.  Phishwish: A Stateless Phishing Filter Using Minimal Rules , 2008, Financial Cryptography.

[7]  Mark Dredze,et al.  Learning Fast Classifiers for Image Spam , 2007, CEAS.

[8]  Lina Zhou,et al.  Phishing environments, techniques, and countermeasures: A survey , 2017, Comput. Secur..

[9]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[10]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[11]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[12]  Youssef Iraqi,et al.  Phishing Detection: A Literature Survey , 2013, IEEE Communications Surveys & Tutorials.

[13]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[14]  Alec Wolman,et al.  Itrustpage: a user-assisted anti-phishing tool , 2008, Eurosys '08.

[15]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[16]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[17]  Taghi M. Khoshgoftaar,et al.  RUSBoost: Improving classification performance when training data is skewed , 2008, 2008 19th International Conference on Pattern Recognition.

[18]  John Yearwood,et al.  Consensus Clustering and Supervised Classification for Profiling Phishing Emails in Internet Commerce Security , 2010, PKAW.

[19]  Gilchan Park,et al.  Text-based phishing detection using a simulation model , 2013 .

[20]  Dharma P. Agrawal,et al.  Fighting against phishing attacks: state of the art and future challenges , 2016, Neural Computing and Applications.

[21]  Joelle Pineau,et al.  Online Bagging and Boosting for Imbalanced Data Streams , 2013, IEEE Transactions on Knowledge and Data Engineering.

[22]  Gürsel Serpen,et al.  Application of Machine Learning Algorithms to KDD Intrusion Detection Dataset within Misuse Detection Context , 2003, MLMTA.

[23]  Douglas H. Fisher,et al.  A Case Study of Incremental Concept Induction , 1986, AAAI.