Self-Contained Entity Discovery from Captioned Videos