LDSLVISION SUBMISSIONS TO DCASE’21: A MULTI-MODAL FUSION APPROACH FOR AUDIO-VISUAL SCENE CLASSIFICATION ENHANCED BY CLIP VARIANTS Technical Report