Word-to-region attention network for visual question answering