Localizing Moments in Long Video Via Multimodal Guidance