For 360° video, the existing visual quality assessment (VQA) approaches are designed based on either the whole frames or the cropped patches, ignoring the fact that subjects can only access viewports. When watching 360° video, subjects select viewports through head movement (HM) and then fixate on attractive regions within the viewports through eye movement (EM). Therefore, this paper proposes a two-staged multi-task approach for viewport-based VQA on 360° video. Specifically, we first establish a large-scale VQA dataset of 360° video, called VQA-ODV, which collects the subjective quality scores and the HM and EM data on 600 video sequences. By mining our dataset, we find that the subjective quality of 360° video is related to camera motion, viewport positions and saliency within viewports. Accordingly, we propose a viewport-based convolutional neural network (V-CNN) approach for VQA on 360° video, which has a novel multi-task architecture composed of a viewport proposal network (VP-net) and viewport quality network (VQ-net). The VP-net handles the auxiliary tasks of camera motion detection and viewport proposal, while the VQ-net accomplishes the auxiliary task of viewport saliency prediction and the main task of VQA. The experiments validate that our V-CNN approach significantly advances state-of-the-art VQA performance on 360° video and it is also effective in the three auxiliary tasks.