Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics