Transformer-based Self-supervised Multimodal Representation Learning for Wearable Emotion Recognition