Clover: Towards A Unified Video-Language Alignment and Fusion Model