Skimming, Locating, then Perusing: A Human-Like Framework for Natural Language Video Localization