Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization