Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model