Explaining transformer-based image captioning models: An empirical analysis