Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding