M33D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding