RTQ: Rethinking Video-language Understanding Based on Image-text Model