Homoglyph attack detection with unpaired data

The human eyes fall prey to cyber-attacks designed to visually deceive us. One such attack that has been effective is named spoofing or homoglyph attack. A homoglyph attack employs a simple trick to deceive us by using a spoof domain or process (file) name that is hard to distinguish from the legitimate names. Due to this, a user might be drawn to click on the spoof process or domain names, and in worst-case it results in triggering any malicious malware planted in them. As a result, their sensitive personal information might be at risk of being exposed. To address the problem mentioned above, existing works use simple approaches related to string comparison techniques that are extensively applied to compare genomes. Although they are effective, these methods are computationally expensive and suffer from low precision due to high false positive predictions. In recent years, machine learning has been applied to a variety of problems, and similar efforts have been made to address homoglyph attacks with neural networks to improve the efficiency of preemptive cyber-attack detection. However, both of these approaches have a common constraint, which is related to the requirement of paired sequences to determine the difference between real vs. spoof strings. As a result, existing approaches are not practical to real-world scenarios when paired sequences are unavailable. In this paper, we introduce a new unpaired homoglyph attack detection system using a convolutional neural network. We formulate two unpaired datasets based on the original datasets reported in [36], which contain real and spoof names for both domains and processes. We train the model end-to-end in a supervised manner. Our experiments demonstrate the robustness of our model in terms of performance in detecting homoglyph attacks. Additionally, it is easy to integrate into any browser with a simple REST [28] API. We show that our model can reach state-of-the-art in detecting homoglyph attack with 94% accuracy on the domain spoof dataset and 95% accuracy on process spoof dataset even without requiring paired data as input. We believe that this work is useful in real-world to appropriately safeguard sensitive information of the users from adversaries.

[1]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Sam Ruby,et al.  RESTful Web Services , 2007 .

[4]  Stephen Morris,et al.  Typo-Squatting: The Curse'' of Popularity , 2009 .

[5]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[6]  Ian D. Reid,et al.  RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9]  Hennie A. Kruger,et al.  Identity Theft - Empirical evidence from a Phishing Exercise , 2007, SEC.

[10]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  John M. Walker,et al.  Comparative Genomics , 2007, Methods In Molecular Biology™.

[12]  K. P. Soman,et al.  Evaluating deep learning approaches to characterize and classify malicious URL's , 2018, J. Intell. Fuzzy Syst..

[13]  Viktor Krammer Phishing defense against IDN address spoofing attacks , 2006, PST.

[14]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Hyrum S. Anderson,et al.  Detecting Homoglyph Attacks with a Siamese Neural Network , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[18]  Zhuo Lu,et al.  Cyber security in the Smart Grid: Survey and challenges , 2013, Comput. Networks.

[19]  Naima Kaabouch,et al.  Cyber security in the Smart Grid: Survey and challenges , 2013, Comput. Networks.

[20]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[22]  James B. Fraley,et al.  The promise of machine learning in cybersecurity , 2017, SoutheastCon 2017.

[23]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[24]  Daniela Chudá,et al.  Plagiarism Detection in Students ’ Assignments Written in Natural Language , 2016 .

[25]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Steven C. H. Hoi,et al.  URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection , 2018, ArXiv.

[27]  Niels Provos,et al.  A framework for detection and measurement of phishing attacks , 2007, WORM '07.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jason Hong,et al.  The state of phishing attacks , 2012, Commun. ACM.

[30]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[31]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[32]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[33]  Cui Yu,et al.  Rapid Homoglyph Prediction and Detection , 2018, 2018 1st International Conference on Data Intelligence and Security (ICDIS).

[34]  Mark Stevenson,et al.  Plagiarism Detection in Texts Obfuscated with Homoglyphs , 2017, ECIR.

[35]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..