In traditional remote sensing image captioning models, the attention mechanism plays a dominant role and has been used to integrate image features to infer the latent visual-semantic alignment. However, the scenes of remote sensing image are complex and diverse, using only one attention module to capture features often leads to insufficient semantic representation. In our work, we present a novel Multi-view Attention Network (MAN) model to realize feature integration from different views. With MAN, more semantically rich ensemble attended features can be obtained by different attention modules. Specifically, we enforce the weights of attention modules to be diverse through a cosine distance loss. This will provide the model with distinct views to make semantic predictions for each feature. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed model for the task of remote sensing image captioning.