论文信息 - Memo No . 90 July 18 , 2019 Theory III : Dynamics and Generalization in Deep Networks 1

Memo No . 90 July 18 , 2019 Theory III : Dynamics and Generalization in Deep Networks 1

Classical generalization bounds for classification in the setting of separable data can be optimized by maximizing the margin of a deep network under the constraint of unit p-norm of the weight matrix at each layer. A possible approach for solving numerically this problem uses gradient algorithms on exponential-type loss functions, enforcing a unit constraint in the p-norm. In the limiting case of continous gradient flow, we analyze the dynamical systems associated with three algorithms of this kind and their close relation for p = 2 with existing weight normalization and batch normalization algorithms. We prove that unconstrained gradient descent has a similar dynamics with the same critical points and thus also maximizes margin wrt L2 norm (but not other p-norms). Our approach extends some of the results of Srebro from linear networks to deep networks and provides a new perspective on the implicit bias of gradient descent. The elusive complexity control we describe is responsible, at least in part, for the puzzling empirical finding of good generalization despite overparametrization by deep networks. This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. 1This replaces previous versions of Theory IIIa and TheoryIIIb. Theory III: Dynamics and Generalization in Deep Networks∗ Andrzej Banburski 1, Qianli Liao1, Brando Miranda1, Tomaso Poggio1, Lorenzo Rosasco1, and Jack Hidary2 1Center for Brains, Minds and Machines, MIT 1CSAIL, MIT 2Alphabet (Google) X Abstract Classical generalization bounds for classification in the setting of separable data can be optimized by maximizing the margin of a deep network under the constraint of unit p-norm of the weight matrix at each layer. A possible approach for solving numerically this problem uses gradient algorithms on exponential-type loss functions, enforcing a unit constraint in the p-norm. In the limiting case of continous gradient flow, we analyze the dynamical systems associated with three algorithms of this kind and their close relation for p = 2 with existing weight normalization and batch normalization algorithms. We prove that unconstrained gradient descent has a similar dynamics with the same critical points and thus also maximizes margin wrt L2 norm (but not other p-norms). Our approach extends some of the results of Srebro from linear networks to deep networks and provides a new perspective on the implicit bias of gradient descent. The elusive complexity control we describe is responsible, at least in part, for the puzzling empirical finding of good generalization despite overparametrization by deep networks.Classical generalization bounds for classification in the setting of separable data can be optimized by maximizing the margin of a deep network under the constraint of unit p-norm of the weight matrix at each layer. A possible approach for solving numerically this problem uses gradient algorithms on exponential-type loss functions, enforcing a unit constraint in the p-norm. In the limiting case of continous gradient flow, we analyze the dynamical systems associated with three algorithms of this kind and their close relation for p = 2 with existing weight normalization and batch normalization algorithms. We prove that unconstrained gradient descent has a similar dynamics with the same critical points and thus also maximizes margin wrt L2 norm (but not other p-norms). Our approach extends some of the results of Srebro from linear networks to deep networks and provides a new perspective on the implicit bias of gradient descent. The elusive complexity control we describe is responsible, at least in part, for the puzzling empirical finding of good generalization despite overparametrization by deep networks.