A TRAINABLE VISUALLY-GROUNDED SPOKEN LANGUAGE GENERATION SYSTEM