暂无分享,去创建一个
Richard Socher | Caiming Xiong | Nitish Shirish Keskar | Akhilesh Gotmare | R. Socher | Akhilesh Deepak Gotmare | N. Keskar | Caiming Xiong
[1] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[2] Hod Lipson,et al. Convergent Learning: Do different neural networks learn the same representations? , 2015, FE@NIPS.
[3] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.
[4] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.
[5] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[6] Tomaso A. Poggio,et al. Theory II: Landscape of the Empirical Risk in Deep Learning , 2017, ArXiv.
[7] Andrew Gordon Wilson,et al. Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.
[8] Zachary Chase Lipton,et al. Born Again Neural Networks , 2018, ICML.
[9] Brian McWilliams,et al. The Shattered Gradients Problem: If resnets are the answer, then what is the question? , 2017, ICML.
[10] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[11] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.
[12] Yoshua Bengio,et al. Three Factors Influencing Minima in SGD , 2017, ArXiv.
[13] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .
[14] Quoc V. Le,et al. Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.
[15] Elad Hoffer,et al. Fix your classifier: the marginal value of training the last weight layer , 2018, ICLR.
[16] Yann Dauphin,et al. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.
[17] Kunle Olukotun,et al. Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark , 2018, ACM SIGOPS Oper. Syst. Rev..
[18] Tengyu Ma,et al. Identity Matters in Deep Learning , 2016, ICLR.
[19] Jascha Sohl-Dickstein,et al. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.
[20] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[21] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Leslie N. Smith,et al. Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).
[23] Andrew Gordon Wilson,et al. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.
[24] Marcin Junczys-Dowmunt,et al. Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation , 2018, EMNLP.
[25] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[26] Hao Li,et al. Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.
[27] Razvan Pascanu,et al. Sharp Minima Can Generalize For Deep Nets , 2017, ICML.
[28] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[29] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[30] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.
[31] H. Hotelling. Relations Between Two Sets of Variates , 1936 .
[32] Aleksander Madry,et al. How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NIPS 2018.
[33] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.
[34] Kilian Q. Weinberger,et al. Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.
[35] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[36] Fred A. Hamprecht,et al. Essentially No Barriers in Neural Network Energy Landscape , 2018, ICML.