Detecting AI Trojans Using Meta Neural Analysis

In machine learning Trojan attacks, an adversary trains a corrupted model that obtains good performance on normal data but behaves maliciously on data samples with certain trigger patterns. Several approaches have been proposed to detect such attacks, but they make undesirable assumptions about the attack strategies or require direct access to the trained models, which restricts their utility in practice. This paper addresses these challenges by introducing a Meta Neural Trojan Detection (MNTD) pipeline that does not make assumptions on the attack strategies and only needs black-box access to models. The strategy is to train a meta-classifier that predicts whether a given target model is Trojaned. To train the meta-model without knowledge of the attack strategy, we introduce a technique called jumbo learning that samples a set of Trojaned models following a general distribution. We then dynamically optimize a query set together with the meta-classifier to distinguish between Trojaned and benign models. We evaluate MNTD with experiments on vision, speech, tabular data and natural language text datasets, and against different Trojan attacks such as data poisoning attack, model manipulation attack, and latent attack. We show that MNTD achieves 97% detection AUC score and significantly outperforms existing detection approaches. In addition, MNTD generalizes well and achieves high detection performance against unforeseen attacks. We also propose a robust MNTD pipeline which achieves 90% detection AUC even when the attacker aims to evade the detection with full knowledge of the system.

[1]  Nikita Borisov,et al.  Property Inference Attacks on Fully Connected Neural Networks using Permutation Invariant Representations , 2018, CCS.

[2]  Walter J. Scheirer,et al.  Backdooring Convolutional Neural Networks via Targeted Weight Perturbations , 2018, 2020 IEEE International Joint Conference on Biometrics (IJCB).

[3]  Brendan Dolan-Gavitt,et al.  Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks , 2018, RAID.

[4]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[5]  Vitaly Shmatikov,et al.  Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Dawn Xiaodong Song,et al.  Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , 2017, ArXiv.

[8]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[9]  Ankur Srivastava,et al.  Neural Trojans , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[10]  Yiran Chen,et al.  Generative Poisoning Attack Method Against Neural Networks , 2017, ArXiv.

[11]  Brendan Dolan-Gavitt,et al.  BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , 2017, ArXiv.

[12]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[13]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Jishen Zhao,et al.  DeepInspect: A Black-box Trojan Detection and Mitigation Framework for Deep Neural Networks , 2019, IJCAI.

[15]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[16]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Fabio Roli,et al.  Towards Poisoning of Deep Learning Algorithms with Back-gradient Optimization , 2017, AISec@CCS.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Wen-Chuan Lee,et al.  NIC: Detecting Adversarial Samples with Neural Network Invariant Checking , 2019, NDSS.

[20]  Yevgeniy Vorobeychik,et al.  Data Poisoning Attacks on Factorization-Based Collaborative Filtering , 2016, NIPS.

[21]  Michael P. Wellman,et al.  SoK: Security and Privacy in Machine Learning , 2018, 2018 IEEE European Symposium on Security and Privacy (EuroS&P).

[22]  Ben Y. Zhao,et al.  Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[23]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  濱田 吉郎,et al.  信頼性制御と Frobenius Norm 最適化 , 1997 .

[25]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[26]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[27]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Sencun Zhu,et al.  Backdoor Embedding in Convolutional Neural Network Models via Invisible Perturbation , 2018, CODASPY.

[30]  Ting Wang,et al.  Backdoor attacks against learning systems , 2017, 2017 IEEE Conference on Communications and Network Security (CNS).

[31]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[32]  Damith Chinthana Ranasinghe,et al.  STRIP: a defence against trojan attacks on deep neural networks , 2019, ACSAC.

[33]  Yingjie Lao,et al.  Hardware Trojan Attacks on Neural Networks , 2018, ArXiv.

[34]  Giovanni Felici,et al.  Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers , 2013, Int. J. Secur. Networks.

[35]  Heiko Ludwig,et al.  Mitigating Poisoning Attacks on Machine Learning Models: A Data Provenance Based Approach , 2017, AISec@CCS.

[36]  Ben Y. Zhao,et al.  Latent Backdoor Attacks on Deep Neural Networks , 2019, CCS.

[37]  Qi Wei,et al.  Hu-Fu: Hardware and Software Collaborative Attack Framework Against Neural Networks , 2018, 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[38]  Wen-Chuan Lee,et al.  Trojaning Attack on Neural Networks , 2018, NDSS.

[39]  Vitaly Shmatikov,et al.  Inference Attacks Against Collaborative Learning , 2018, ArXiv.

[40]  Blaine Nelson,et al.  Poisoning Attacks against Support Vector Machines , 2012, ICML.

[41]  Vitaly Shmatikov,et al.  How To Backdoor Federated Learning , 2018, AISTATS.

[42]  Sanjay Chawla,et al.  Anomaly Detection using One-Class Neural Networks , 2018, ArXiv.

[43]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[44]  Jerry Li,et al.  Spectral Signatures in Backdoor Attacks , 2018, NeurIPS.

[45]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[46]  Benny Pinkas,et al.  Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring , 2018, USENIX Security Symposium.

[47]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[49]  Dan Boneh,et al.  SentiNet: Detecting Physical Attacks Against Deep Learning Systems , 2018, ArXiv.

[50]  Benjamin Edwards,et al.  Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering , 2018, SafeAI@AAAI.

[51]  Jinfeng Yi,et al.  ZOO: Zeroth Order Optimization Based Black-box Attacks to Deep Neural Networks without Training Substitute Models , 2017, AISec@CCS.

[52]  Michael P. Wellman,et al.  Towards the Science of Security and Privacy in Machine Learning , 2016, ArXiv.

[53]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Seong Joon Oh,et al.  Towards Reverse-Engineering Black-Box Neural Networks , 2017, ICLR.

[55]  Wenyi Huang,et al.  MtNet: A Multi-Task Neural Network for Dynamic Malware Classification , 2016, DIMVA.