MalJPEG: Machine Learning Based Solution for the Detection of Malicious JPEG Images

In recent years, cyber-attacks against individuals, businesses, and organizations have increased. Cyber criminals are always looking for effective vectors to deliver malware to victims in order to launch an attack. Images are used on a daily basis by millions of people around the world, and most users consider images to be safe for use; however, some types of images can contain a malicious payload and perform harmful actions. JPEG is the most popular image format, primarily due to its lossy compression. It is used by almost everyone, from individuals to large organizations, and can be found on almost every device (on digital cameras and smartphones, websites, social media, etc.). Because of their harmless reputation, massive use, and high potential for misuse, JPEG images are used by cyber criminals as an attack vector. While machine learning methods have been shown to be effective at detecting known and unknown malware in various domains, to the best of our knowledge, machine learning methods have not been used particularly for the detection of malicious JPEG images. In this paper, we present MalJPEG, the first machine learning-based solution tailored specifically at the efficient detection of unknown malicious JPEG images. MalJPEG statically extracts 10 simple yet discriminative features from the JPEG file structure and leverages them with a machine learning classifier, in order to discriminate between benign and malicious JPEG images. We evaluated MalJPEG extensively on a real-world representative collection of 156,818 images which contains 155,013 (98.85%) benign and 1,805 (1.15%) malicious images. The results show that MalJPEG, when used with the LightGBM classifier, demonstrates the highest detection capabilities, with an area under the receiver operating characteristic curve (AUC) of 0.997, true positive rate (TPR) of 0.951, and a very low false positive rate (FPR) of 0.004.

[1]  Lior Rokach,et al.  Novel active learning methods for enhanced PC malware detection in windows OS , 2014, Expert Syst. Appl..

[2]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[3]  Jessica Fridrich,et al.  Steganography With Multiple JPEG Images of the Same Scene , 2017, IEEE Transactions on Information Forensics and Security.

[4]  Lior Rokach,et al.  Sec-Lib: Protecting Scholarly Digital Libraries From Infected Papers Using Active Machine Learning Framework , 2019, IEEE Access.

[5]  Ah Reum Kang,et al.  Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks , 2019, Secur. Commun. Networks.

[6]  M. Daly,et al.  Guilt by association , 2000, Nature Genetics.

[7]  Kevin Gimpel,et al.  Early Methods for Detecting Adversarial Images , 2016, ICLR.

[8]  Priyanka Sharma,et al.  Framework to detect malicious codes embedded with JPEG images over social networking sites , 2017, 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS).

[9]  Roberto Baldoni,et al.  Survey on the Usage of Machine Learning Techniques for Malware Analysis , 2017, Comput. Secur..

[10]  James A. Storer,et al.  Protecting JPEG Images Against Adversarial Attacks , 2018, 2018 Data Compression Conference.

[11]  Yuval Elovici,et al.  Unknown malcode detection and the imbalance problem , 2009, Journal in Computer Virology.

[12]  Sanjay Sharma,et al.  Detection of Advanced Malware by Machine Learning Techniques , 2018, Advances in Intelligent Systems and Computing.

[13]  Lior Rokach,et al.  SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods , 2016, Expert Syst. Appl..

[14]  Yuval Elovici,et al.  ALDOCX: Detection of Unknown Malicious Microsoft Office Documents Using Designated Active Learning Methods Based on New Structural Feature Extraction Methodology , 2017, IEEE Transactions on Information Forensics and Security.

[15]  Richard Shin JPEG-resistant Adversarial Images , 2017 .

[16]  Ciprian Oprisa,et al.  Locality-sensitive hashing optimizations for fast malware clustering , 2014, 2014 IEEE 10th International Conference on Intelligent Computer Communication and Processing (ICCP).

[17]  Lior Rokach,et al.  Scholarly Digital Libraries as a Platform for Malware Distribution , 2017, SG-CRC.

[18]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[19]  F. Liu,et al.  A Robust Image Steganography on Resisting JPEG Compression with No Side Information , 2018 .

[20]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[21]  Yuval Elovici,et al.  TrustSign: Trusted Malware Signature Generation in Private Clouds Using Deep Feature Transfer Learning , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[22]  Pavel Laskov,et al.  Hidost: a static machine-learning-based detector of malicious files , 2016, EURASIP J. Inf. Secur..

[23]  Yuval Elovici,et al.  Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework , 2016, Security Informatics.

[24]  Priya Narasimhan,et al.  Binary Function Clustering Using Semantic Hashes , 2012, 2012 11th International Conference on Machine Learning and Applications.

[25]  Amaury Lendasse,et al.  A Two-Stage Methodology Using K-NN and False-Positive Minimizing ELM for Nominal Data Classification , 2014, Cognitive Computation.

[26]  Mansour Ahmadi,et al.  IntelliAV: Toward the Feasibility of Building Intelligent Anti-malware on Android Devices , 2017, CD-MAKE.

[27]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[28]  Ananthram Swami,et al.  Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[29]  Nir Nissim,et al.  Trusted detection of ransomware in a private cloud using machine learning methods leveraging meta-features from volatile memory , 2018, Expert Syst. Appl..

[30]  Shunquan Tan,et al.  Application of quantisation‐based deep‐learning model compression in JPEG image steganalysis , 2018, The Journal of Engineering.

[31]  Yuval Elovici,et al.  Boosting the Detection of Malicious Documents Using Designated Active Learning Methods , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[32]  Alptekin Temizel,et al.  The Effects of JPEG and JPEG2000 Compression on Attacks using Adversarial Examples , 2018, ArXiv.

[33]  Yuval Elovici,et al.  Trusted system-calls analysis methodology aimed at detection of compromised virtual machines using sequential mining , 2018, Knowl. Based Syst..

[34]  Qiong Zhang,et al.  Anti-Forensics of JPEG Compression Using Generative Adversarial Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[35]  Lior Rokach,et al.  Detecting unknown computer worm activity via support vector machines and active learning , 2012, Pattern Analysis and Applications.

[36]  Gary Geunbae Lee,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006, Inf. Process. Manag..

[37]  Li Chen,et al.  Keeping the Bad Guys Out: Protecting and Vaccinating Deep Learning with JPEG Compression , 2017, ArXiv.

[38]  Shih-Fu Chang,et al.  A robust image authentication method distinguishing JPEG compression from malicious manipulation , 2001, IEEE Trans. Circuits Syst. Video Technol..

[39]  Edward K. Wong,et al.  JPEG Steganalysis Based on DenseNet , 2017, ArXiv.

[40]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[41]  Yuval Elovici,et al.  Detection of malicious PDF files and directions for enhancements: A state-of-the art survey , 2015, Comput. Secur..

[42]  Pavel Laskov,et al.  Detection of Malicious PDF Files Based on Hierarchical Document Structure , 2013, NDSS.

[43]  Yuval Elovici,et al.  ALPD: Active Learning Framework for Enhancing the Detection of Malicious PDF Files , 2014, 2014 IEEE Joint Intelligence and Security Informatics Conference.

[44]  Yun Q. Shi,et al.  JPEG image steganalysis utilizing both intrablock and interblock correlations , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[45]  Jiangqun Ni,et al.  Efficient JPEG Steganography Using Domain Transformation of Embedding Entropy , 2018, IEEE Signal Processing Letters.

[46]  J. Crussell,et al.  Scalable Semantics-Based Detection of Similar Android Applications , 2013 .

[47]  Zoubin Ghahramani,et al.  A study of the effect of JPG compression on adversarial images , 2016, ArXiv.

[48]  Jessica J. Fridrich,et al.  Natural Steganography in JPEG Compressed Images , 2018, Media Watermarking, Security, and Forensics.

[49]  Yuval Elovici,et al.  Novel set of general descriptive features for enhanced detection of malicious emails using machine learning methods , 2018, Expert Syst. Appl..

[50]  Marc Chaumont,et al.  Quantitative and Binary Steganalysis in JPEG: A Comparative Study , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[51]  Richard E. Harang,et al.  MEADE: Towards a Malicious Email Attachment Detection Engine , 2018, 2018 IEEE International Symposium on Technologies for Homeland Security (HST).

[52]  Konstantin Berlin,et al.  Deep neural network based malware detection using two dimensional binary program features , 2015, 2015 10th International Conference on Malicious and Unwanted Software (MALWARE).

[53]  Mo Chen,et al.  JPEG-Phase-Aware Convolutional Neural Network for Steganalysis of JPEG Images , 2017, IH&MMSec.