Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework