Masked Autoencoders As Spatiotemporal Learners