Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?