Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer