Language-Aware Spatial-Temporal Collaboration for Referring Video Segmentation