Visual and language semantic hybrid enhancement and complementary for video description