On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization