An Empirical Study of Multimodal Model Merging