The multi-modal fusion in visual question answering: a review of attention mechanisms