Question Aware Vision Transformer for Multimodal Reasoning