暂无分享,去创建一个
Alec Radford | Tom Henighan | Jared Kaplan | Tom B. Brown | Scott Gray | Dario Amodei | Sam McCandlish | Rewon Child | Jeffrey Wu | Benjamin Chess | Jeff Wu | Alec Radford | Dario Amodei | T. Henighan | J. Kaplan | Scott Gray | Sam McCandlish | Rewon Child | Benjamin Chess | S. Gray | B. Chess | R. Child
[1] W. Ebeling,et al. Entropy and Long-Range Correlations in Literary English , 1993, cond-mat/0204108.
[2] Joshua Goodman,et al. A bit of progress in language modeling , 2001, Comput. Speech Lang..
[3] Michele Banko,et al. Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.
[4] Stergios B. Fotopoulos,et al. All of Nonparametric Statistics , 2007, Technometrics.
[5] Gérard Biau,et al. Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..
[6] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[7] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[8] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[9] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.
[10] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.
[11] Serge J. Belongie,et al. Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.
[12] Max Tegmark,et al. Criticality in Formal Languages and Statistical Physics , 2016 .
[13] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[14] Martial Hebert,et al. Growing a Brain: Fine-Tuning by Increasing Model Capacity , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Diederik P. Kingma,et al. GPU Kernels for Block-Sparse Weights , 2017 .
[16] Yang Yang,et al. Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.
[17] Stefan Thurner,et al. Introduction to the Theory of Complex Systems , 2018, Oxford Scholarship Online.
[18] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[19] Vardan Papyan,et al. The Full Spectrum of Deep Net Hessians At Scale: Dynamics with Sample Size , 2018, ArXiv.
[20] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.
[21] Ethan Dyer,et al. Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.
[22] Lukasz Kaiser,et al. Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.
[23] Mikhail Belkin,et al. Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.
[24] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[25] Dario Amodei,et al. An Empirical Model of Large-Batch Training , 2018, ArXiv.
[26] Arthur Jacot,et al. Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.
[27] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[28] Newsha Ardalani,et al. Beyond human-level accuracy: computational challenges in deep learning , 2019, PPoPP.
[29] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[30] Shankar Krishnan,et al. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.
[31] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.
[32] Aran Komatsuzaki,et al. One Epoch Is All You Need , 2019, ArXiv.
[33] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[34] Jaehoon Lee,et al. Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.
[35] Lukasz Kaiser,et al. Universal Transformers , 2018, ICLR.
[36] Jascha Sohl-Dickstein,et al. Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..
[37] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[38] Guodong Zhang,et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.
[39] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[40] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[41] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[42] Levent Sagun,et al. Scaling description of generalization with number of parameters in deep learning , 2019, Journal of Statistical Mechanics: Theory and Experiment.
[43] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[44] Andrew M. Saxe,et al. High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.
[45] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[46] Jonathan S. Rosenfeld,et al. A Constructive Prediction of the Generalization Error Across Scales , 2019, ICLR.
[47] Feng Yan,et al. AutoGrow: Automatic Layer Growing in Deep Convolutional Networks , 2019, KDD.