With the growth of video-sharing platforms and social media applications, video retrieval plays an import role in many aspects, such as copyright infringement detection, event classification, personalized recommendation, and etc. The content-based video retrieval presents the following two main challenges: (i) Distribution inconsistency for feature representation from the source domain to the target domain. (ii) Difficulty of video aggregation by sufficiently incorporating frame-based information. In this paper, we propose an unsupervised teacher-student model (UTS Net) to improve the performance of the content-based video retrieval tasks: (i) A teacher-student model maintaining the global consistency for feature representation from different domains and retaining the local inconsistency within the intra-batch data; (ii) A simple but effective video retrieval pipeline integrating the frame-level binarized feature. Our proposed framework experimentally outperforms the state-of-the-art approach on the DSVR, CSVR, and ISVR tasks in the FIVR datasets, and achieves a mean average precision of 76%, 72%, and 61%, respectively.