Data and Parameter Scaling Laws for Neural Machine Translation

We observe that the development cross001 entropy loss of supervised neural machine 002 translation models scales like a power law with 003 the amount of training data and the number of 004 non-embedding parameters in the model. We 005 discuss some practical implications of these re006 sults, such as predicting BLEU achieved by 007 large scale models and predicting the ROI of 008 labeling data in low-resource language pairs. 009