Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation

Neural networks are part of many contemporary NLP systems, yet their empirical successes come at the price of vulnerability to adversarial attacks. Previous work has used adversarial training and data augmentation to partially mitigate such brittleness, but these are unlikely to find worst-case adversaries due to the complexity of the search space arising from discrete text perturbations. In this work, we approach the problem from the opposite direction: to formally verify a system’s robustness against a predefined class of adversarial attacks. We study text classification under synonym replacements or character flip perturbations. We propose modeling these input perturbations as a simplex and then using Interval Bound Propagation – a formal model verification method. We modify the conventional log-likelihood training objective to train models that can be efficiently verified, which would otherwise come with exponential search complexity. The resulting models show only little difference in terms of nominal accuracy, but have much improved verified accuracy under perturbations and come with an efficiently computable formal guarantee on worst case adversaries.

[1]  Jinfeng Yi,et al.  Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples , 2018, AAAI.

[2]  Po-Sen Huang,et al.  Knowing When to Stop: Evaluation and Verification of Conformity to Output-Size Specifications , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Aleksander Madry,et al.  Robustness May Be at Odds with Accuracy , 2018, ICLR.

[4]  Timothy A. Mann,et al.  On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models , 2018, ArXiv.

[5]  Matthew Mirman,et al.  Differentiable Abstract Interpretation for Provably Robust Neural Networks , 2018, ICML.

[6]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[7]  Pushmeet Kohli,et al.  Training verified learners with learned verifiers , 2018, ArXiv.

[8]  Guoyin Wang,et al.  Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms , 2018, ACL.

[9]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[10]  Junfeng Yang,et al.  Formal Security Analysis of Neural Networks using Symbolic Intervals , 2018, USENIX Security Symposium.

[11]  Inderjit S. Dhillon,et al.  Towards Fast Computation of Certified Robustness for ReLU Networks , 2018, ICML.

[12]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[13]  Pushmeet Kohli,et al.  Adversarial Risk and the Dangers of Evaluating Against Weak Attacks , 2018, ICML.

[14]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[15]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[16]  Aditi Raghunathan,et al.  Certified Defenses against Adversarial Examples , 2018, ICLR.

[17]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[18]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[19]  J. Zico Kolter,et al.  Provable defenses against adversarial examples via the convex outer adversarial polytope , 2017, ICML.

[20]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[21]  Emily M. Bender,et al.  Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task , 2017, Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems.

[22]  Pushmeet Kohli,et al.  Piecewise Linear Neural Network verification: A comparative study , 2017, ArXiv.

[23]  Clark W. Barrett,et al.  Provably Minimally-Distorted Adversarial Examples , 2017 .

[24]  David L. Dill,et al.  Ground-Truth Adversarial Examples , 2017, ArXiv.

[25]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[26]  David Wagner,et al.  Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods , 2017, AISec@CCS.

[27]  Chih-Hong Cheng,et al.  Maximum Resilience of Artificial Neural Networks , 2017, ATVA.

[28]  Timothy Baldwin,et al.  Robust Training under Linguistic Adversity , 2017, EACL.

[29]  Mykel J. Kochenderfer,et al.  Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.

[30]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[31]  Yongqiang Wang,et al.  Two Efficient Lattice Rescoring Methods Using Recurrent Neural Network Language Models , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  David Vandyke,et al.  Counter-fitting Word Vectors to Linguistic Constraints , 2016, NAACL.

[33]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[36]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[37]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[38]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[39]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[40]  Masao Utiyama,et al.  Paraphrase Lattice for Statistical Machine Translation , 2010, ACL.

[41]  Cesare Tinelli,et al.  Satisfiability Modulo Theories , 2021, Handbook of Satisfiability.

[42]  Christel Baier,et al.  Principles of model checking , 2008 .

[43]  Smaranda Muresan,et al.  Generalizing Word Lattice Translation , 2008, ACL.

[44]  Richard Zens,et al.  Speech Translation by Confusion Network Decoding , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[45]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.