MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French

Modeling multimodal language is a core research area in natural language processing. While languages such as English have relatively large multimodal language resources, other widely spoken languages across the globe have few or no large-scale datasets in this area. This disproportionately affects native speakers of languages other than English. As a step towards building more equitable and inclusive multimodal systems, we introduce the first large-scale multimodal language dataset for Spanish, Portuguese, German and French. The proposed dataset, called CMU-MOSEAS (CMU Multimodal Opinion Sentiment, Emotions and Attributes), is the largest of its kind with 40,000 total labelled sentences. It covers a diverse set topics and speakers, and carries supervision of 20 labels including sentiment (and subjectivity), emotions, and attributes. Our evaluations on a state-of-the-art multimodal model demonstrates that CMU-MOSEAS enables further research for multilingual studies in multimodal language.

[1]  Paavo Alku,et al.  Parabolic spectral parameter - A new method for quantification of the glottal flow , 1997, Speech Commun..

[2]  Songlong Xing,et al.  Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion , 2020, AAAI.

[3]  Jamal Berrich,et al.  Sentiment Analysis of French Tweets based on Subjective Lexicon Approach: Evaluation of the use of OpenNLP and CoreNLP Tools , 2018, J. Comput. Sci..

[4]  Rada Mihalcea,et al.  Beneath the Tip of the Iceberg: Current Challenges and New Directions in Sentiment Analysis Research , 2020, IEEE Transactions on Affective Computing.

[5]  Martin Vondra,et al.  Recognition of Emotions in German Speech Using Gaussian Mixture Models , 2008, COST 2102 School.

[6]  Ruslan Salakhutdinov,et al.  Strong and Simple Baselines for Multimodal Utterance Embeddings , 2019, NAACL.

[7]  Barnabás Póczos,et al.  Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[8]  D G Childers,et al.  Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[9]  Louis-Philippe Morency,et al.  M-BERT: Injecting Multimodal Information in the BERT Structure , 2019, ArXiv.

[10]  Songlong Xing,et al.  Locally Confined Modality Fusion Network With a Global Perspective for Multimodal Human Affective Computing , 2020, IEEE Transactions on Multimedia.

[11]  Louis-Philippe Morency,et al.  Multimodal Language Analysis with Recurrent Multistage Fusion , 2018, EMNLP.

[12]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[13]  Ivan Marsic,et al.  Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition , 2019, ACM Multimedia.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[16]  Carsten Gips,et al.  Sentiment Analysis of a German Twitter-Corpus , 2017, LWDA.

[17]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[18]  C. Miranda,et al.  A review of Sentiment Analysis in Spanish , 2016 .

[19]  Luis Alfonso Ureña López,et al.  Cross-Domain Sentiment Analysis Using Spanish Opinionated Words , 2014, NLDB.

[20]  I R Titze,et al.  Vocal intensity in speakers and singers. , 1991, The Journal of the Acoustical Society of America.

[21]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[22]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[24]  Maria das Graças Volpe Nunes,et al.  Building a Sentiment Corpus of Tweets in Brazilian Portuguese , 2017, LREC.

[25]  Hamid Karimi Interpretable Multimodal Deception Detection in Videos , 2018, ICMI.

[26]  Paavo Alku,et al.  Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering , 1991, Speech Commun..

[27]  Kaicheng Yang,et al.  CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality , 2020, ACL.

[28]  Iryna Gurevych,et al.  Multimodal Grounding for Language Processing , 2018, COLING.

[29]  Louis-Philippe Morency,et al.  Factorized Multimodal Transformer for Multimodal Sequential Learning , 2019, ArXiv.

[30]  John Kane,et al.  Wavelet Maxima Dispersion for Breathy to Tense Voice Discrimination , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Stefan Wermter,et al.  Incorporating End-to-End Speech Recognition Models for Sentiment Analysis , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[32]  Louis-Philippe Morency,et al.  OpenFace 2.0: Facial Behavior Analysis Toolkit , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[33]  C. Renfrew The Origins of Indo-European Languages , 1989 .

[34]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[35]  Abeer Alwan,et al.  Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[36]  Patrick A. Naylor,et al.  Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Florian Metze,et al.  How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.

[38]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[39]  Björn W. Schuller,et al.  YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[40]  Rada Mihalcea,et al.  Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[41]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[42]  Louis-Philippe Morency,et al.  Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach , 2014, ICMI.

[43]  Louis-Philippe Morency,et al.  UR-FUNNY: A Multimodal Language Dataset for Understanding Humor , 2019, EMNLP.

[44]  Ferran Plà,et al.  Spanish sentiment analysis in Twitter at the TASS workshop , 2018, Lang. Resour. Evaluation.

[45]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[46]  Stefanos Zafeiriou,et al.  RetinaFace: Single-stage Dense Face Localisation in the Wild , 2019, ArXiv.

[47]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[48]  Sen Wang,et al.  Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[49]  José Carlos González,et al.  TASS - Workshop on Sentiment Analysis at SEPLN , 2013, Proces. del Leng. Natural.

[50]  Louis-Philippe Morency,et al.  MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[51]  Parisa Kordjamshidi,et al.  Spatial Language Understanding with Multimodal Graphs using Declarative Learning based Programming , 2017, SPNLP@EMNLP.

[52]  Verónica Pérez-Rosas,et al.  Utterance-Level Multimodal Sentiment Analysis , 2013, ACL.

[53]  Jianhai Zhang,et al.  Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling , 2019, NeurIPS.

[54]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[56]  Swati Gupta,et al.  Multimodal sentiment analysis: Sentiment analysis using audiovisual format , 2015, 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom).

[57]  Louis-Philippe Morency,et al.  Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[58]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[59]  Guillermo Moncecchi,et al.  A Crowd-Annotated Spanish Corpus for Humor Analysis , 2017, SocialNLP@ACL.

[60]  P. Perniss Why We Should Study Multimodal Language , 2018, Front. Psychol..

[61]  Fatih Uzdilli,et al.  A Twitter Corpus and Benchmark Resources for German Sentiment Analysis , 2017, SocialNLP@EACL.

[62]  Yingyu Liang,et al.  Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis , 2019, AAAI.

[63]  Sebastian Zepf,et al.  Towards Multimodal Emotion Recognition in German Speech Events in Cars using Transfer Learning , 2019, KONVENS.

[64]  Verónica Pérez-Rosas,et al.  Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper) , 2019, ACL.

[65]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[66]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[67]  P. Ekman,et al.  Facial signs of emotional experience. , 1980 .

[68]  Ruslan Salakhutdinov,et al.  Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization , 2019, ACL.

[69]  P. Alku,et al.  Normalized amplitude quotient for parametrization of the glottal flow. , 2002, The Journal of the Acoustical Society of America.

[70]  Ruslan Salakhutdinov,et al.  Learning Factorized Multimodal Representations , 2018, ICLR.

[71]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[72]  Guillermo Moncecchi,et al.  Is This a Joke? Detecting Humor in Spanish Tweets , 2016, IBERAMIA.