MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling