Level-wise aligned dual networks for text–video retrieval