Benchmarking Long-tail Generalization with Likelihood Splits