Boosting the phishing detection performance by semantic analysis

Phishing is increasingly severe in recent years, which seriously threatens the privacy and property security of netizens. Phishing is essentially a counterfeiting of brands. In order to effectively cheat the victim, phishing sites are visually and semantically highly similar to real sites. In recent years, anti-phishing methods based on machine learning are mainstream anti-phishing methods. The effectiveness of the machine learning models hinges on the extracted statistical features. However, the extracted statistical features mainly focus on visual similarity, stealing information and third-party services, which ignore the semantic information of web pages. Therefore, we extract a series of semantic features through word2vec to better describe the features of phishing sites, and further fuse them with other multi-scale statistical features to construct a more robust phishing detection model. The experimental results on the actual data sets show that the majority of phishing websites are effectively identified by only mining the semantic features of word embeddings. The phishing detection models based on fusion features obtained the best detection results, which shows that semantic features and other statistical features have good complementarity. The proposed method provides a promising way for phishing detection in actual Internet environment, which boosts the phishing detection performance effectively.

[1]  Wei Wang,et al.  Favicon - a clue to phishing sites detection , 2013, 2013 APWG eCrime Researchers Summit.

[2]  Susan Mengel,et al.  Examination of data, rule generation and detection of phishing URLs using online logistic regression , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[3]  Guanggang Geng,et al.  Combating phishing attacks via brand identity and authorization features , 2015, Secur. Commun. Networks.

[4]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[5]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[6]  Kang-Leng Chiew,et al.  Utilisation of website logo for phishing detection , 2015, Comput. Secur..

[7]  Rabab Kreidieh Ward,et al.  Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[9]  Bowen Zhou,et al.  Dependency-based Convolutional Neural Networks for Sentence Embedding , 2015, ACL.

[10]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[11]  Thamar Solorio,et al.  Lexical feature based phishing URL detection using online learning , 2010, AISec '10.

[12]  Carolyn Penstein Rosé,et al.  CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites , 2011, TSEC.

[13]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[14]  Min-Shiang Hwang,et al.  The Novel Features for Phishing Based on User Device Detection , 2016, J. Comput..

[15]  Samuel Marchal,et al.  Know Your Phish: Novel Techniques for Detecting Phishing Sites and Their Targets , 2015, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[16]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[17]  Susan Mengel,et al.  Phishing URL Detection Using URL Ranking , 2015, 2015 IEEE International Congress on Big Data.

[18]  Jun Zhao,et al.  How to Generate a Good Word Embedding , 2015, IEEE Intelligent Systems.

[19]  Robert Wilensky,et al.  Robust Hyperlinks and Locations , 2000, D Lib Mag..

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Yoshua Bengio,et al.  Neural net language models , 2008, Scholarpedia.

[22]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[23]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.