Towards Knowledge-Aware Video Captioning via Transitive Visual Relationship Detection