Convolutional Neural Network Based Ensemble Approach for Homoglyph Recognition

Homoglyphs are pairs of visual representations of Unicode characters that look similar to the human eye. Identifying homoglyphs is extremely useful for building a strong defence mechanism against many phishing and spoofing attacks, ID imitation, profanity abusing, etc. Although there is a list of discovered homoglyphs published by Unicode consortium, regular expansion of Unicode character scripts necessitates a robust and reliable algorithm that is capable of identifying all possible new homoglyphs. In this article, we first show that shallow Convolutional Neural Networks are capable of identifying homoglyphs. We propose two variations, both of which obtain very high accuracy (99.44%) on our benchmark dataset. We also report that adoption of transfer learning allows for another model to achieve 100% recall on our dataset. We ensemble these three methods to obtain 99.72% accuracy on our independent test dataset. These results illustrate the superiority of our ensembled model in detecting homoglyphs and suggest that our model can be used to detect new homoglyphs when increasing Unicode characters are added. As a by-product, we also prepare a benchmark dataset based on the currently available list of homoglyphs.

[1]  Taghi M. Khoshgoftaar,et al.  A survey on heterogeneous transfer learning , 2017, Journal of Big Data.

[2]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[3]  R. Vinayakumar,et al.  Siamese neural network architecture for homoglyph attacks detection , 2020, ICT Express.

[4]  Calton Pu,et al.  Evolutionary study of phishing , 2008, 2008 eCrime Researchers Summit.

[5]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[6]  Tao Wang,et al.  Perceptual Gradient Similarity Deviation for Full Reference Image Quality Assessment , 2018 .

[7]  Jens Krinke,et al.  A picture is worth a thousand words: Code clone detection based on image similarity , 2018, 2018 IEEE 12th International Workshop on Software Clones (IWSC).

[8]  Akira Yamada,et al.  Visual similarity-based phishing detection without victim site information , 2009, 2009 IEEE Symposium on Computational Intelligence in Cyber Security.

[9]  Eugénio C. Oliveira,et al.  What We Can Learn from Looking at Profanity , 2014, PROPOR.

[10]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[11]  Jianfang Dou,et al.  Robust image matching based on the information of SIFT , 2018, Optik.

[12]  Cui Yu,et al.  Rapid Homoglyph Prediction and Detection , 2018, 2018 1st International Conference on Data Intelligence and Security (ICDIS).

[13]  Shijian Lu,et al.  Multilingual scene character recognition with co-occurrence of histogram of oriented gradients , 2016, Pattern Recognit..

[14]  Xiaotie Deng,et al.  An antiphishing strategy based on visual similarity assessment , 2006, IEEE Internet Computing.

[15]  Lambert Schomaker,et al.  Beyond OCR: Multi-faceted understanding of handwritten document characteristics , 2017, Pattern Recognit..

[16]  A. Hedayati,et al.  An analysis of identity theft: Motives, related frauds, techniques and prevention , 2012 .

[17]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[18]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[20]  Liang Gao,et al.  A New Transfer Learning Based on VGG-19 Network for Fault Diagnosis , 2019, 2019 IEEE 23rd International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[21]  Xiaotie Deng,et al.  Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover's Distance (EMD) , 2006, IEEE Transactions on Dependable and Secure Computing.

[22]  Christophe Garcia,et al.  Contribution of recurrent connectionist language models in improving LSTM-based Arabic text recognition in videos , 2017, Pattern Recognit..

[23]  Shijian Lu,et al.  Accurate recognition of words in scenes without character segmentation using recurrent neural network , 2017, Pattern Recognit..

[24]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[26]  Dimosthenis Karatzas,et al.  Improving patch-based scene text script identification with ensembles of conjoined networks , 2016, Pattern Recognit..

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Eric Medvet,et al.  Visual-similarity-based phishing detection , 2008, SecureComm.

[29]  Sergio Rojas Galeano On Obstructing Obscenity Obfuscation , 2017, ACM Trans. Web.

[30]  Yang Wang,et al.  Homoglyph attack detection with unpaired data , 2019, SEC.

[31]  Gintautas Palubinskas,et al.  Image similarity/distance measures: what is really behind MSE and SSIM? , 2017 .

[32]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Björn Schuller,et al.  Can Deep Generative Audio be Emotional? Towards an Approach for Personalised Emotional Audio Generation , 2019, 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP).

[34]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Adam M. Costello Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA) , 2003, RFC.

[36]  Xiaotie Deng,et al.  Regap: a Tool for Unicode-based Web Identity Fraud Detection Regap: a Tool for Unicode-based Web Identity Fraud Detection De , 2022 .

[37]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[38]  Nicola Torelli,et al.  Training and assessing classification rules with imbalanced data , 2012, Data Mining and Knowledge Discovery.

[39]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[40]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[41]  Hyrum S. Anderson,et al.  Detecting Homoglyph Attacks with a Siamese Neural Network , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[42]  James Miller,et al.  Finding Homoglyphs - A Step towards Detecting Unicode-Based Visual Spoofing Attacks , 2011, WISE.

[43]  Umapada Pal,et al.  Cross-language Framework for Word Recognition and Spotting of Indic Scripts , 2017, Pattern Recognit..

[44]  Chung-Yen Su,et al.  An Enhanced Hybrid MobileNet , 2017, 2018 9th International Conference on Awareness Science and Technology (iCAST).

[45]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[46]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[47]  Joel R. Tetreault,et al.  Do Characters Abuse More Than Words? , 2016, SIGDIAL Conference.

[48]  Viktor Krammer Phishing defense against IDN address spoofing attacks , 2006, PST.

[49]  Hung-yi Lee,et al.  Learning Chinese Word Representations From Glyphs Of Characters , 2017, EMNLP.

[50]  Rashid Khan,et al.  Prevention Approach of Phishing on Different Websites , 2012 .