VQA is a task to answer natural language questions tied to the content of visual images. Most VQA approaches apply attention mechanism to focus on the relevant visual objects and/or consider the relations between objects via off-the-shelf methods in visual relation reasoning. However, they still suffer from several drawbacks. First, they mostly model the simple relations between objects, which results in many complicated questions cannot be answered correctly, because of failing to provide sufficient knowledge. Second, they seldom leverage the harmony cooperation of visual appearance feature and relation feature. To solve these problems, we propose a novel end-to-end VQA model, termed Multi-modal Relation Attention Network (MRA-Net). The proposed model explores both textual and visual relations to improve performance and interpretability. In specific, we devise 1) a self-guided word relation attention scheme, which explore the latent semantic relations between words; 2) two questionadaptive visual relation attention modules that can extract not only the fine-grained and precise binary relations between objects but also the more sophisticated trinary relations. Both question-related visual relations provide more and deeper visual semantics, thereby improving the visual reasoning ability of question answering. Furthermore, the proposal combines appearance feature with relations to reconcile the two types of features.