Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective