Motion-aware Contrastive Video Representation Learning via Foreground-background Merging