Extracting High-level Multimodal Features