Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding