A Unified View of Local Learning : Theory and Algorithms for Enhancing Linear Models. (Une Vue Unifiée de l'Apprentissage Local : Théorie et Algorithmes pour l'Amélioration de Modèles Linéaires)

In Machine Learning field, data characteristics usually vary over the space: the overall distribution might be multi-modal and contain non-linearities.In order to achieve good performance, the learning algorithm should then be able to capture and adapt to these changes. Even though linear models fail to describe complex distributions, they are renowned for their scalability, at training and at testing, to datasets big in terms of number of examples and of number of features. Several methods have been proposed to take advantage of the scalability and the simplicity of linear hypotheses to build models with great discriminatory capabilities. These methods empower linear models, in the sense that they enhance their expressive power through different techniques. This dissertation focuses on enhancing local learning approaches, a family of techniques that infers models by capturing the local characteristics of the space in which the observations are embedded. The founding assumption of these techniques is that the learned model should behave consistently on examples that are close, implying that its results should also change smoothly over the space. The locality can be defined on spatial criteria (e.g. closeness according to a selected metric) or other provided relations, such as the association to the same category of examples or a shared attribute. Local learning approaches are known to be effective in capturing complex distributions of the data, avoiding to resort to selecting a model specific for the task. However, state of the art techniques suffer from three major drawbacks: they easily memorize the training set, resulting in poor performance on unseen data; their predictions lack of smoothness in particular locations of the space;they scale poorly with the size of the datasets. The contributions of this dissertation investigate the aforementioned pitfalls in two directions: we propose to introduce side information in the problem formulation to enforce smoothness in prediction and attenuate the memorization phenomenon; we provide a new representation for the dataset which takes into account its local specificities and improves scalability. Thorough studies are conducted to highlight the effectiveness of the said contributions which confirmed the soundness of their intuitions. We empirically study the performance of the proposed methods both on toy and real tasks, in terms of accuracy and execution time, and compare it to state of the art results. We also analyze our approaches from a theoretical standpoint, by studying their computational and memory complexities and by deriving tight generalization bounds.

[1]  Jason Weston,et al.  Breaking SVM Complexity with Cross-Training , 2004, NIPS.

[2]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[3]  John Shawe-Taylor,et al.  Two view learning: SVM-2K, Theory and Practice , 2005, NIPS.

[4]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[5]  Shiliang Sun,et al.  Multi-view learning overview: Recent progress and new challenges , 2017, Inf. Fusion.

[6]  Yishay Mansour,et al.  Robust domain adaptation , 2013, Annals of Mathematics and Artificial Intelligence.

[7]  Ananda Theertha Suresh,et al.  Distributed Mean Estimation with Limited Communication , 2016, ICML.

[8]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  Nathan Srebro,et al.  Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss , 2012, ICML.

[11]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[12]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[13]  Shiliang Sun,et al.  A survey of multi-view machine learning , 2013, Neural Computing and Applications.

[14]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[15]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[16]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification and Regression , 1995, NIPS.

[17]  Frank Nielsen,et al.  Loss factorization, weakly supervised learning and label noise robustness , 2016, ICML.

[18]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[19]  Qiang Yang,et al.  Collaborative boosting for activity classification in microblogs , 2013, KDD.

[20]  Stéphan Clémençon,et al.  Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions , 2016, ICML.

[21]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[22]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[23]  Rachid Guerraoui,et al.  Personalized and Private Peer-to-Peer Machine Learning , 2017, AISTATS.

[24]  Ameet Talwalkar,et al.  Federated Multi-Task Learning , 2017, NIPS.

[25]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[26]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[27]  Massih-Reza Amini,et al.  Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization , 2009, NIPS.

[28]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[29]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[30]  Cynthia Dwork,et al.  Differential Privacy , 2006, Encyclopedia of Cryptography and Security.

[31]  Shan Sung Liew,et al.  Bounded activation functions for enhanced training stability of deep neural networks on visual pattern recognition problems , 2016, Neurocomputing.

[32]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[33]  Cor J. Veenman,et al.  Kernel Codebooks for Scene Categorization , 2008, ECCV.

[34]  Yihong Gong,et al.  Nonlinear Learning using Local Coordinate Coding , 2009, NIPS.

[35]  Eliezer Yudkowsky Artificial Intelligence as a Positive and Negative Factor in Global Risk , 2006 .

[36]  Marc Sebban,et al.  Modeling Perceptual Color Differences by Local Metric Learning , 2014, ECCV.

[37]  Rémi Emonet,et al.  Fast and Provably Effective Multi-view Classification with Landmark-Based SVM , 2018, ECML/PKDD.

[38]  Ananthram Swami,et al.  Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[39]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[40]  Marc Sebban,et al.  A Survey on Metric Learning for Feature Vectors and Structured Data , 2013, ArXiv.

[41]  Jeff Cooper,et al.  Improved algorithms for distributed boosting , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[42]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[43]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[44]  Ya Zhang,et al.  Multi-task learning for boosting with application to web search ranking , 2010, KDD.

[45]  Ming Li,et al.  Learning in the presence of malicious errors , 1993, STOC '88.

[46]  Lorenzo Rosasco,et al.  Are Loss Functions All the Same? , 2004, Neural Computation.

[47]  Ingo Steinwart,et al.  Sparseness of Support Vector Machines , 2003, J. Mach. Learn. Res..

[48]  Martin Jaggi,et al.  On the Global Linear Convergence of Frank-Wolfe Optimization Variants , 2015, NIPS.

[49]  John K. Tsotsos,et al.  Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing , 2018, 2019 16th Conference on Computer and Robot Vision (CRV).

[50]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[51]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[52]  Søren Hauberg,et al.  A Geometric take on Metric Learning , 2012, NIPS.

[53]  Wencheng Wu,et al.  The CIEDE2000 color-difference formula: Implementation notes, supplementary test data, and mathematical observations , 2005 .

[54]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[55]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[56]  David A. Wagner,et al.  Defensive Distillation is Not Robust to Adversarial Examples , 2016, ArXiv.

[57]  Ronan Collobert,et al.  Word Embeddings through Hellinger PCA , 2013, EACL.

[58]  Francis R. Bach,et al.  A convex relaxation for weakly supervised classifiers , 2012, ICML.

[59]  Yanjun Qi,et al.  Feature Squeezing Mitigates and Detects Carlini/Wagner Adversarial Examples , 2017, ArXiv.

[60]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[61]  Maria-Florina Balcan,et al.  Improved Guarantees for Learning via Similarity Functions , 2008, COLT.

[62]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[63]  Zhen Li,et al.  Hierarchical Gaussianization for image classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[64]  Ludovic Denoyer,et al.  Multi-View Data Generation Without View Supervision , 2018, ICLR.

[65]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[66]  Martin J. Wainwright,et al.  Information-theoretic lower bounds for distributed statistical estimation with communication constraints , 2013, NIPS.

[67]  Massih-Reza Amini,et al.  PAC-Bayesian Analysis for a Two-Step Hierarchical Multiview Learning Approach , 2016, ECML/PKDD.

[68]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Prateek Jain,et al.  Similarity-based Learning via Data Driven Embeddings , 2011, NIPS.

[70]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[71]  Damien Garreau,et al.  Metric Learning for Temporal Sequence Alignment , 2014, NIPS.

[72]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[73]  Jingjing Tang,et al.  Multiview Privileged Support Vector Machines , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[74]  Philip H. S. Torr,et al.  Locally Linear Support Vector Machines , 2011, ICML.

[75]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[76]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[77]  Ivor W. Tsang,et al.  Convex and scalable weakly labeled SVMs , 2013, J. Mach. Learn. Res..

[78]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[79]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[80]  Michael I. Jordan,et al.  The Handbook of Brain Theory and Neural Networks , 2002 .

[81]  Maria-Florina Balcan,et al.  Distributed Learning, Communication Complexity and Privacy , 2012, COLT.

[82]  Hisashi Kashima,et al.  Distributed Multi-task Learning for Sensor Network , 2017, ECML/PKDD.

[83]  Jinfeng Yi,et al.  Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach , 2018, ICLR.

[84]  Stéphane Ayache,et al.  Parsimonious unsupervised and semi-supervised domain adaptation with good similarity functions , 2012, Knowledge and Information Systems.

[85]  Eric Moulines,et al.  D-FW: Communication efficient distributed algorithms for high-dimensional sparse optimization , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[86]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[87]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[88]  Mladen Kolar,et al.  Distributed Multitask Learning , 2015, ArXiv.

[89]  Byoung-Tak Zhang,et al.  Generative Local Metric Learning for Nearest Neighbor Classification , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[90]  Amaury Habrard,et al.  Robustness and generalization for metric learning , 2012, Neurocomputing.

[91]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[92]  Aleksander Madry,et al.  There Is No Free Lunch In Adversarial Robustness (But There Are Unexpected Benefits) , 2018, ArXiv.

[93]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[94]  Rémi Emonet,et al.  Lipschitz Continuity of Mahalanobis Distances and Bilinear Forms , 2016, ArXiv.

[95]  Maria-Florina Balcan,et al.  A theory of learning with similarity functions , 2008, Machine Learning.

[96]  E Weinan,et al.  Functional Frank-Wolfe Boosting for General Loss Functions , 2015, ArXiv.

[97]  Mladen Kolar,et al.  Distributed Multi-Task Learning with Shared Representation , 2016, ArXiv.

[98]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[99]  L. H. Loomis An Introduction to Abstract Harmonic Analysis , 1953 .

[100]  Vittorio Murino,et al.  A unifying framework for vector-valued manifold regularization and multi-view learning , 2013, ICML.

[101]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[102]  Vittorio Murino,et al.  A Unifying Framework in Vector-valued Reproducing Kernel Hilbert Spaces for Manifold Regularization and Co-Regularized Multi-view Learning , 2014, J. Mach. Learn. Res..

[103]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[104]  Juho Rousu,et al.  Multi-view kernel completion , 2016, Machine Learning.

[105]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[106]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[107]  Changshui Zhang,et al.  Network Game and Boosting , 2005, ECML.

[108]  Ludovic Denoyer,et al.  Multi-view Generative Adversarial Networks , 2016, ECML/PKDD.

[109]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[110]  Stephen P. Boyd,et al.  ECOS: An SOCP solver for embedded systems , 2013, 2013 European Control Conference (ECC).

[111]  Chunhua Shen,et al.  On the Dual Formulation of Boosting Algorithms , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[112]  David Wagner,et al.  Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods , 2017, AISec@CCS.

[113]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[114]  Prasoon Goyal,et al.  Local Deep Kernel Learning for Efficient Non-linear SVM Prediction , 2013, ICML.

[115]  Shiliang Sun,et al.  Multi-view Laplacian Support Vector Machines , 2011, ADMA.

[116]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[117]  Rémi Emonet,et al.  Metric Learning as Convex Combinations of Local Models with Generalization Guarantees , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[118]  Patrick D. McDaniel,et al.  On the Effectiveness of Defensive Distillation , 2016, ArXiv.

[119]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[120]  Pedro M. Domingos,et al.  Adversarial classification , 2004, KDD.

[121]  J. Tasic,et al.  Colour spaces: perceptual, historical and applicational background , 2003, The IEEE Region 8 EUROCON 2003. Computer as a Tool..

[122]  Hachem Kadri,et al.  Multi-view Metric Learning in Vector-valued Kernel Spaces , 2018, AISTATS.

[123]  Michael I. Jordan,et al.  Distributed optimization with arbitrary local solvers , 2015, Optim. Methods Softw..

[124]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[125]  Ohad Shamir,et al.  Distributed stochastic optimization and learning , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[126]  M. Fréchet Sur les fonctionnelles continues , 1910 .

[127]  Piyush Rai,et al.  Multiview Clustering with Incomplete Views , 2010 .

[128]  Christopher Meek,et al.  Adversarial learning , 2005, KDD '05.

[129]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[130]  Lorenzo Bruzzone,et al.  A Novel Transductive SVM for Semisupervised Classification of Remote-Sensing Images , 2006, IEEE Transactions on Geoscience and Remote Sensing.

[131]  Ananthram Swami,et al.  Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[132]  Fabio Roli,et al.  Security Evaluation of Pattern Classifiers under Attack , 2014, IEEE Transactions on Knowledge and Data Engineering.

[133]  Mehryar Mohri,et al.  AUC Optimization vs. Error Rate Minimization , 2003, NIPS.

[134]  Alexandros Kalousis,et al.  Parametric Local Metric Learning for Nearest Neighbor Classification , 2012, NIPS.

[135]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[136]  Valentina Zantedeschi,et al.  Efficient Defenses Against Adversarial Attacks , 2017, AISec@CCS.

[137]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[138]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[139]  Rong Jin,et al.  Distance Metric Learning: A Comprehensive Survey , 2006 .

[140]  Blaine Nelson,et al.  Can machine learning be secure? , 2006, ASIACCS '06.

[141]  Risto Miikkulainen,et al.  The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities , 2018, Artificial Life.

[142]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[143]  Terrance E. Boult,et al.  Towards Open Set Deep Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[144]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[145]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[146]  Jiayu Zhou,et al.  Asynchronous Multi-task Learning , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[147]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[148]  Ohad Shamir,et al.  Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[149]  Maria-Florina Balcan,et al.  A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning , 2014, SDM.

[150]  Brian Kulis,et al.  Metric Learning: A Survey , 2013, Found. Trends Mach. Learn..

[151]  Jiawei Han,et al.  Clustered Support Vector Machines , 2013, AISTATS.

[152]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[153]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[154]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[155]  Cong Li,et al.  Reduced-Rank Local Distance Metric Learning , 2013, ECML/PKDD.

[156]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[157]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[158]  Jitendra Malik,et al.  Image Retrieval and Classification Using Local Distance Functions , 2006, NIPS.

[159]  Stéphane Ayache,et al.  The Multi-Task Learning View of Multimodal Data , 2013, ACML.

[160]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[161]  Venkatesh Saligrama,et al.  Local Supervised Learning through Space Partitioning , 2012, NIPS.

[162]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[163]  Patrick D. McDaniel,et al.  On the (Statistical) Detection of Adversarial Examples , 2017, ArXiv.

[164]  Yishay Mansour,et al.  On the boosting ability of top-down decision tree learning algorithms , 1996, STOC '96.

[165]  Fabio Roli,et al.  Security Evaluation of Support Vector Machines in Adversarial Environments , 2014, ArXiv.

[166]  Stephen P. Boyd,et al.  Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding , 2013, Journal of Optimization Theory and Applications.

[167]  Marc Tommasi,et al.  Decentralized Collaborative Learning of Personalized Models over Networks , 2016, AISTATS.

[168]  Frank Nielsen,et al.  Bregman Divergences and Surrogates for Learning , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[169]  Barbara Caputo,et al.  Multiclass Latent Locally Linear Support Vector Machines , 2013, ACML.

[170]  Asuman E. Ozdaglar,et al.  Distributed Alternating Direction Method of Multipliers , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[171]  Ananthram Swami,et al.  The Limitations of Deep Learning in Adversarial Settings , 2015, 2016 IEEE European Symposium on Security and Privacy (EuroS&P).

[172]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[173]  Rémi Emonet,et al.  beta-risk: a New Surrogate Risk for Learning from Weakly Labeled Data , 2016, NIPS.

[174]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[175]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[176]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[177]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[178]  Jun Zhou,et al.  Mixing Linear SVMs for Nonlinear Classification , 2010, IEEE Transactions on Neural Networks.

[179]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[180]  Yanjun Qi,et al.  Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks , 2017, NDSS.

[181]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[182]  Bernard Zenko,et al.  Is Combining Classifiers with Stacking Better than Selecting the Best One? , 2004, Machine Learning.

[183]  Zoran Obradovic,et al.  The distributed boosting algorithm , 2001, KDD '01.

[184]  Zhitao Gong,et al.  Adversarial and Clean Data Are Not Twins , 2017, aiDM@SIGMOD.

[185]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[186]  Esther Rolf,et al.  Delayed Impact of Fair Machine Learning , 2018, ICML.

[187]  Stephen Tyree,et al.  Stochastic Neighbor Compression , 2014, ICML.

[188]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[189]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[190]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[191]  Fabio Roli,et al.  Evasion Attacks against Machine Learning at Test Time , 2013, ECML/PKDD.

[192]  Min Wu,et al.  Safety Verification of Deep Neural Networks , 2016, CAV.

[193]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.