Principled approaches to robust machine learning and beyond

As we apply machine learning to more and more important tasks, it becomes increasingly important that these algorithms are robust to systematic, or worse, malicious, noise. Despite considerable interest, no efficient algorithms were known to be robust to such noise in high dimensional settings for some of the most fundamental statistical tasks for over sixty years of research. In this thesis we devise two novel, but similarly inspired, algorithmic paradigms for estimation in high dimensions in the presence of a small number of adversarially added data points. Both algorithms are the first efficient algorithms which achieve (nearly) optimal error bounds for a number fundamental statistical tasks such as mean estimation and covariance estimation. The goal of this thesis is to present these two frameworks in a clean and unified manner. We show that these insights also have applications for other problems in learning theory. Specifically, we show that these algorithms can be combined with the powerful Sum-of-Squares hierarchy to yield improvements for clustering high dimensional Gaussian mixture models, the first such improvement in over fifteen years of research. Going full circle, we show that Sum-of-Squares also can be used to improve error rates for robust mean estimation. Not only are these algorithms of interest theoretically, but we demonstrate empirically that we can use these insights in practice to uncover patterns in high dimensional data that were previously masked by noise. Based on our algorithms, we give new implementations for robust PCA, new defenses for data poisoning attacks for stochastic optimization, and new defenses for watermarking attacks on deep nets. In all of these tasks, we demonstrate on both synthetic and real data sets that our performance is substantially better than the state-of-the-art, often able to detect most to all corruptions when previous methods could not reliably detect any. Thesis Supervisor: Ankur Moitra Title: Rockwell International CD Associate Professor of Mathematics

[1]  Alexandre d'Aspremont,et al.  Optimal Solutions for Sparse Principal Component Analysis , 2007, J. Mach. Learn. Res..

[2]  Prasad Raghavendra,et al.  The Power of Sum-of-Squares for Detecting Hidden Structures , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[3]  Sanjeev Arora,et al.  LEARNING MIXTURES OF SEPARATED NONSPHERICAL GAUSSIANS , 2005, math/0503457.

[4]  Tselil Schramm,et al.  Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors , 2015, STOC.

[5]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, NIPS 2004.

[6]  Dan Alistarh,et al.  The Power of Choice in Priority Scheduling , 2017, PODC.

[7]  Christos Tzamos,et al.  Ten Steps of EM Suffice for Mixtures of Two Gaussians , 2016, COLT.

[8]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[9]  Sanjeev Arora,et al.  Learning mixtures of arbitrary gaussians , 2001, STOC '01.

[10]  Aravindan Vijayaraghavan,et al.  On Learning Mixtures of Well-Separated Gaussians , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[11]  Micah Sherr,et al.  Hidden Voice Commands , 2016, USENIX Security Symposium.

[12]  Sivaraman Balakrishnan,et al.  Computationally Efficient Robust Estimation of Sparse Functionals , 2017, ArXiv.

[13]  Yuan Zhou,et al.  Approximability and proof complexity , 2012, SODA.

[14]  Jerry Li,et al.  Computationally Efficient Robust Sparse Estimation in High Dimensions , 2017, COLT.

[15]  Crina Grosan,et al.  Meta-QSAR: a large-scale application of meta-learning to drug design and discovery , 2017, Machine Learning.

[16]  Sivaraman Balakrishnan,et al.  Robust estimation via robust gradient estimation , 2018, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[17]  Pablo A. Parrilo,et al.  The Convex Geometry of Linear Inverse Problems , 2010, Foundations of Computational Mathematics.

[18]  Cristina Nita-Rotaru,et al.  On the Practicality of Integrity Attacks on Document-Level Sentiment Analysis , 2014, AISec '14.

[19]  Tselil Schramm,et al.  Fast and robust tensor decomposition with applications to dictionary learning , 2017, COLT.

[20]  Martin J. Wainwright,et al.  Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[21]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixtures of distributions , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[22]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[23]  B. Nadler,et al.  MINIMAX BOUNDS FOR SPARSE PCA WITH NOISY HIGH-DIMENSIONAL DATA. , 2012, Annals of statistics.

[24]  David P. Woodruff,et al.  Communication lower bounds for statistical estimation problems via a distributed data processing inequality , 2015, STOC.

[25]  Rocco A. Servedio,et al.  Learning Halfspaces with Malicious Noise , 2009, ICALP.

[26]  Ming Li,et al.  Learning in the presence of malicious errors , 1993, STOC '88.

[27]  Atul Prakash,et al.  Robust Physical-World Attacks on Machine Learning Models , 2017, ArXiv.

[28]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[29]  S. Charles Brubaker,et al.  Extensions of principal components analysis , 2009 .

[30]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[31]  Aditya Bhaskara,et al.  Smoothed analysis of tensor decompositions , 2013, STOC.

[32]  Santosh S. Vempala,et al.  Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[33]  Constantine Caramanis,et al.  Robust PCA via Outlier Pursuit , 2010, IEEE Transactions on Information Theory.

[34]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[35]  Rocco A. Servedio,et al.  Learning mixtures of structured distributions over discrete domains , 2012, SODA.

[36]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[37]  Daniel M. Kane,et al.  Learning geometric concepts with nasty noise , 2017, STOC.

[38]  Petros Drineas,et al.  Ancestry informative markers for fine-scale individual assignment to worldwide populations , 2010, Journal of Medical Genetics.

[39]  Amit Kumar,et al.  Clustering with Spectral Norm and the k-Means Algorithm , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[40]  Harrison H. Zhou,et al.  Estimating structured high-dimensional covariance and precision matrices: Optimal rates and adaptive estimation , 2016 .

[41]  J. Jewkes,et al.  Theory of Location of Industries. , 1933 .

[42]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[43]  Yurii Nesterov,et al.  Generalized Power Method for Sparse Principal Component Analysis , 2008, J. Mach. Learn. Res..

[44]  Jerry Li,et al.  Spectral Signatures in Backdoor Attacks , 2018, NeurIPS.

[45]  Daniel M. Kane,et al.  Statistical Query Lower Bounds for Robust Estimation of High-Dimensional Gaussians and Gaussian Mixtures , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[46]  Jerry Li,et al.  Fast Algorithms for Segmented Regression , 2016, ICML.

[47]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[48]  Gregory Valiant,et al.  Learning Discrete Distributions from Untrusted Batches , 2017, ITCS.

[49]  T. Cai,et al.  Sparse PCA: Optimal rates and adaptive estimation , 2012, 1211.1309.

[50]  Dan Alistarh,et al.  Distributionally Linearizable Data Structures , 2018, SPAA.

[51]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[52]  Rocco A. Servedio,et al.  Learning from satisfying assignments , 2015, SODA.

[53]  Lujo Bauer,et al.  Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition , 2016, CCS.

[54]  Daniel M. Kane,et al.  The fourier transform of poisson multinomial distributions and its algorithmic applications , 2015, STOC.

[55]  Jerry Li,et al.  Sever: A Robust Meta-Algorithm for Stochastic Optimization , 2018, ICML.

[56]  Gregory Valiant,et al.  Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers , 2017, ITCS.

[57]  M. Wainwright,et al.  High-dimensional analysis of semidefinite relaxations for sparse principal components , 2008, 2008 IEEE International Symposium on Information Theory.

[58]  Qingqing Huang,et al.  Learning Mixtures of Gaussians in High Dimensions , 2015, STOC.

[59]  Johan Löfberg,et al.  YALMIP : a toolbox for modeling and optimization in MATLAB , 2004 .

[60]  Pranjal Awasthi,et al.  Improved Spectral-Norm Bounds for Clustering , 2012, APPROX-RANDOM.

[61]  Sanjoy Dasgupta,et al.  A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians , 2007, J. Mach. Learn. Res..

[62]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[63]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[64]  Joel A. Tropp,et al.  Robust Computation of Linear Models by Convex Relaxation , 2012, Foundations of Computational Mathematics.

[65]  Jerry Li,et al.  On the Limitations of First-Order Approximation in GAN Dynamics , 2017, ICML.

[66]  Philippe Rigollet,et al.  Computational Lower Bounds for Sparse PCA , 2013, ArXiv.

[67]  Gilad Lerman,et al.  A novel M-estimator for robust PCA , 2011, J. Mach. Learn. Res..

[68]  Yong Zhang,et al.  An augmented Lagrangian approach for sparse principal component analysis , 2009, Mathematical Programming.

[69]  Leslie G. Valiant,et al.  Learning Disjunction of Conjunctions , 1985, IJCAI.

[70]  Tengyu Ma,et al.  Polynomial-Time Tensor Decompositions with Sum-of-Squares , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[71]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[72]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[73]  David Steurer,et al.  Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method , 2014, STOC.

[74]  Maria-Florina Balcan,et al.  The Power of Localization for Efficiently Learning Linear Separators with Noise , 2013, J. ACM.

[75]  Dan Alistarh,et al.  Byzantine Stochastic Gradient Descent , 2018, NeurIPS.

[76]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[77]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[78]  Ronitt Rubinfeld,et al.  On the learnability of discrete distributions , 1994, STOC '94.

[79]  Jerry Li,et al.  Communication-Efficient Distributed Learning of Discrete Distributions , 2017, NIPS.

[80]  Yoshua Bengio,et al.  The Consciousness Prior , 2017, ArXiv.

[81]  B. Nadler,et al.  DO SEMIDEFINITE RELAXATIONS SOLVE SPARSE PCA UP TO THE INFORMATION LIMIT , 2013, 1306.3690.

[82]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[83]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[84]  Jerry Li,et al.  Replacing Mark Bits with Randomness in Fibonacci Heaps , 2015, ICALP.

[85]  Jonathan Shi,et al.  Tensor principal component analysis via sum-of-square proofs , 2015, COLT.

[86]  Ronald E. Robertson,et al.  The search engine manipulation effect (SEME) and its possible impact on the outcomes of elections , 2015, Proceedings of the National Academy of Sciences.

[87]  Moritz Hardt,et al.  Tight Bounds for Learning a Mixture of Two Gaussians , 2014, STOC.

[88]  Ankur Moitra,et al.  Settling the Polynomial Learnability of Mixtures of Gaussians , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[89]  Constantinos Daskalakis,et al.  Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians , 2013, COLT.

[90]  Tengyu Ma,et al.  On Communication Cost of Distributed Statistical Estimation and Dimensionality , 2014, NIPS.

[91]  L. Lovász,et al.  Geometric Algorithms and Combinatorial Optimization , 1981 .

[92]  Avi Wigderson,et al.  Sum-of-Squares Lower Bounds for Sparse PCA , 2015, NIPS.

[93]  Ryan O'Donnell,et al.  SOS Is Not Obviously Automatizable, Even Approximately , 2016, ITCS.

[94]  Pravesh Kothari,et al.  Better Agnostic Clustering Via Relaxed Tensor Norms , 2017, ArXiv.

[95]  Mikhail Belkin,et al.  The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures , 2013, COLT.

[96]  David Steurer,et al.  Exact tensor completion with sum-of-squares , 2017, COLT.

[97]  Arian Maleki,et al.  Global Analysis of Expectation Maximization for Mixtures of Two Gaussians , 2016, NIPS.

[98]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[99]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2009, Found. Comput. Math..

[100]  Zhaoran Wang,et al.  On the Statistical Limits of Convex Relaxations , 2015, ICML.

[101]  Jerry Li,et al.  Robust and Proper Learning for Mixtures of Gaussians via Systems of Polynomial Inequalities , 2017, COLT.

[102]  Dan Alistarh,et al.  The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning , 2016, ICML 2017.

[103]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[104]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[105]  Ilias Diakonikolas,et al.  Sample-Optimal Density Estimation in Nearly-Linear Time , 2015, SODA.

[106]  Ankur Moitra,et al.  Optimality and Sub-optimality of PCA for Spiked Random Matrices and Synchronization , 2016, ArXiv.

[107]  Jerry Li,et al.  Being Robust (in High Dimensions) Can Be Practical , 2017, ICML.

[108]  Pravesh Kothari,et al.  Efficient Algorithms for Outlier-Robust Regression , 2018, COLT.

[109]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[110]  Brendan Dolan-Gavitt,et al.  BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , 2017, ArXiv.

[111]  Rocco A. Servedio,et al.  Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms , 2014, NIPS.

[112]  Daniel M. Kane,et al.  Optimal Learning via the Fourier Transform for Sums of Independent Integer Random Variables , 2015, COLT.

[113]  C. O’Brien Statistical Learning with Sparsity: The Lasso and Generalizations , 2016 .

[114]  Ankur Moitra,et al.  Noisy tensor completion via the sum-of-squares hierarchy , 2015, Mathematical Programming.

[115]  Jerry Li,et al.  Robust Sparse Estimation Tasks in High Dimensions , 2017, ArXiv.

[116]  Blaine Nelson,et al.  Poisoning Attacks against Support Vector Machines , 2012, ICML.

[117]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[118]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[119]  J. M. Bremner,et al.  Statistical Inference under Restrictions , 1973 .

[120]  Mikhail Belkin,et al.  Polynomial Learning of Distribution Families , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[121]  Blaine Nelson,et al.  The security of machine learning , 2010, Machine Learning.

[122]  Chinmay Hegde,et al.  Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms , 2015, PODS.

[123]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[124]  Quentin Berthet,et al.  Statistical and computational trade-offs in estimation of sparse principal components , 2014, 1408.5369.

[125]  Sham M. Kakade,et al.  Learning mixtures of spherical gaussians: moment methods and spectral decompositions , 2012, ITCS '13.

[126]  Tudor Dumitras,et al.  Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks , 2018, NeurIPS.

[127]  Jon Feldman,et al.  PAC Learning Axis-Aligned Mixtures of Gaussians with No Separation Assumption , 2006, COLT.

[128]  Jerry Li,et al.  On the Importance of Registers for Computability , 2014, OPODIS.

[129]  Dimitris Achlioptas,et al.  On Spectral Learning of Mixtures of Distributions , 2005, COLT.

[130]  Gregory Valiant,et al.  A Data Prism: Semi-Verified Learning in the Small-Alpha Regime , 2017, COLT.

[131]  Rocco A. Servedio,et al.  Explorer Efficient Density Estimation via Piecewise Polynomial Approximation , 2013 .

[132]  E. Candès,et al.  Detection of an anomalous cluster in a network , 2010, 1001.3209.

[133]  Paul W. Goldberg,et al.  Evolutionary Trees Can be Learned in Polynomial Time in the Two-State General Markov Model , 2001, SIAM J. Comput..

[134]  John C. Duchi,et al.  Minimax rates for memory-bounded sparse linear regression , 2015, COLT.

[135]  Dustin G. Mixon,et al.  Clustering subgaussian mixtures by semidefinite programming , 2016, ArXiv.

[136]  Cameron Musco,et al.  Randomized Block Krylov Methods for Stronger and Faster Approximate Singular Value Decomposition , 2015, NIPS.

[137]  Jerry Li,et al.  Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms , 2018, COLT.

[138]  Rocco A. Servedio,et al.  Smooth boosting and learning with malicious noise , 2003 .

[139]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[140]  Peter J. Huber,et al.  Robustness: Where are we now? , 1997 .

[141]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[142]  Percy Liang,et al.  Certified Defenses for Data Poisoning Attacks , 2017, NIPS.

[143]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[144]  Yishay Mansour,et al.  Estimating a mixture of two product distributions , 1999, COLT '99.

[145]  Moritz Hardt,et al.  Sharp bounds for learning a mixture of two gaussians , 2014, ArXiv.

[146]  Yuanzhi Li,et al.  Even Faster SVD Decomposition Yet Without Agonizing Pain , 2016, NIPS.

[147]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[148]  David Steurer,et al.  Sum-of-squares proofs and the quest toward optimal algorithms , 2014, Electron. Colloquium Comput. Complex..

[149]  A. C. Berry The accuracy of the Gaussian approximation to the sum of independent variates , 1941 .

[150]  Adam Tauman Kalai,et al.  Efficiently learning mixtures of two Gaussians , 2010, STOC '10.

[151]  Alon Orlitsky,et al.  Near-Optimal-Sample Estimators for Spherical Gaussian Mixtures , 2014, NIPS.

[152]  Luc Devroye,et al.  Nonparametric Density Estimation , 1985 .

[153]  Dawn Xiaodong Song,et al.  Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , 2017, ArXiv.

[154]  Daniel M. Kane,et al.  Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[155]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[156]  Gregory Valiant,et al.  Learning from untrusted data , 2016, STOC.

[157]  A. Carbery,et al.  Distributional and L-q norm inequalities for polynomials over convex bodies in R-n , 2001 .

[158]  Prasad Raghavendra,et al.  On the Bit Complexity of Sum-of-Squares Proofs , 2017, ICALP.

[159]  Yevgeniy Vorobeychik,et al.  Data Poisoning Attacks on Factorization-Based Collaborative Filtering , 2016, NIPS.

[160]  Zhaoran Wang,et al.  Sparse PCA with Oracle Property , 2014, NIPS.

[161]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[162]  S. Sheather Density Estimation , 2004 .

[163]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[164]  Dan Alistarh,et al.  The SprayList: a scalable relaxed priority queue , 2015, PPoPP.

[165]  Daniel M. Kane,et al.  List-decodable robust mean estimation and learning mixtures of spherical gaussians , 2017, STOC.

[166]  Amit R. Indap,et al.  Genes mirror geography within Europe , 2008, Nature.

[167]  Jerry Li,et al.  Robustly Learning a Gaussian: Getting Optimal Error, Efficiently , 2017, SODA.

[168]  Santosh S. Vempala,et al.  Isotropic PCA and Affine-Invariant Clustering , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[169]  Ohad Shamir,et al.  Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity , 2015, ICML.

[170]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[171]  Ryan O'Donnell,et al.  Learning Sums of Independent Integer Random Variables , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[172]  Anindya De,et al.  A size-free CLT for poisson multinomials and its applications , 2015, STOC.

[173]  Rocco A. Servedio,et al.  Learning k-Modal Distributions via Testing , 2012, Theory Comput..

[174]  Zongming Ma Sparse Principal Component Analysis and Iterative Thresholding , 2011, 1112.2432.

[175]  Jerry Li,et al.  Privately Learning High-Dimensional Distributions , 2018, COLT.

[176]  Jerry Li,et al.  Mixture models, robustness, and sum of squares proofs , 2017, STOC.

[177]  Tengyu Ma,et al.  Decomposing Overcomplete 3rd Order Tensors using Sum-of-Squares Algorithms , 2015, APPROX-RANDOM.

[178]  Benny Pinkas,et al.  Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring , 2018, USENIX Security Symposium.