Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment