Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA