论文信息 - MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French

MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French

Modeling multimodal language is a core research area in natural language processing. While languages such as English have relatively large multimodal language resources, other widely spoken languages across the globe have few or no large-scale datasets in this area. This disproportionately affects native speakers of languages other than English. As a step towards building more equitable and inclusive multimodal systems, we introduce the first large-scale multimodal language dataset for Spanish, Portuguese, German and French. The proposed dataset, called CMU-MOSEAS (CMU Multimodal Opinion Sentiment, Emotions and Attributes), is the largest of its kind with 40,000 total labelled sentences. It covers a diverse set topics and speakers, and carries supervision of 20 labels including sentiment (and subjectivity), emotions, and attributes. Our evaluations on a state-of-the-art multimodal model demonstrates that CMU-MOSEAS enables further research for multilingual studies in multimodal language.

[1] Paavo Alku,et al. Parabolic spectral parameter - A new method for quantification of the glottal flow , 1997, Speech Commun..

[2] Songlong Xing,et al. Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion , 2020, AAAI.

[3] Jamal Berrich,et al. Sentiment Analysis of French Tweets based on Subjective Lexicon Approach: Evaluation of the use of OpenNLP and CoreNLP Tools , 2018, J. Comput. Sci..

[4] Rada Mihalcea,et al. Beneath the Tip of the Iceberg: Current Challenges and New Directions in Sentiment Analysis Research , 2020, IEEE Transactions on Affective Computing.

[5] Martin Vondra,et al. Recognition of Emotions in German Speech Using Gaussian Mixture Models , 2008, COST 2102 School.

[6] Ruslan Salakhutdinov,et al. Strong and Simple Baselines for Multimodal Utterance Embeddings , 2019, NAACL.

[7] Barnabás Póczos,et al. Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[8] D G Childers,et al. Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[9] Louis-Philippe Morency,et al. M-BERT: Injecting Multimodal Information in the BERT Structure , 2019, ArXiv.

[10] Songlong Xing,et al. Locally Confined Modality Fusion Network With a Global Perspective for Multimodal Human Affective Computing , 2020, IEEE Transactions on Multimedia.

[11] Louis-Philippe Morency,et al. Multimodal Language Analysis with Recurrent Multistage Fusion , 2018, EMNLP.

[12] Rada Mihalcea,et al. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[13] Ivan Marsic,et al. Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition , 2019, ACM Multimedia.

[14] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15] Astrid Paeschke,et al. A database of German emotional speech , 2005, INTERSPEECH.

[16] Carsten Gips,et al. Sentiment Analysis of a German Twitter-Corpus , 2017, LWDA.

[17] Björn Schuller,et al. Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[18] C. Miranda,et al. A review of Sentiment Analysis in Spanish , 2016 .

[19] Luis Alfonso Ureña López,et al. Cross-Domain Sentiment Analysis Using Spanish Opinionated Words , 2014, NLDB.

[20] I R Titze,et al. Vocal intensity in speakers and singers. , 1991, The Journal of the Acoustical Society of America.

[21] Erik Cambria,et al. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[22] Stefanos Zafeiriou,et al. ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Ruslan Salakhutdinov,et al. Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[24] Maria das Graças Volpe Nunes,et al. Building a Sentiment Corpus of Tweets in Brazilian Portuguese , 2017, LREC.

[25] Hamid Karimi. Interpretable Multimodal Deception Detection in Videos , 2018, ICMI.

[26] Paavo Alku,et al. Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering , 1991, Speech Commun..

[27] Kaicheng Yang,et al. CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality , 2020, ACL.

[28] Iryna Gurevych,et al. Multimodal Grounding for Language Processing , 2018, COLING.

[29] Louis-Philippe Morency,et al. Factorized Multimodal Transformer for Multimodal Sequential Learning , 2019, ArXiv.

[30] John Kane,et al. Wavelet Maxima Dispersion for Breathy to Tense Voice Discrimination , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[31] Stefan Wermter,et al. Incorporating End-to-End Speech Recognition Models for Sentiment Analysis , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[32] Louis-Philippe Morency,et al. OpenFace 2.0: Facial Behavior Analysis Toolkit , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[33] C. Renfrew. The Origins of Indo-European Languages , 1989 .

[34] Claire Cardie,et al. Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[35] Abeer Alwan,et al. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[36] Patrick A. Naylor,et al. Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[37] Florian Metze,et al. How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.

[38] Erik Cambria,et al. Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[39] Björn W. Schuller,et al. YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context , 2013, IEEE Intelligent Systems.

[40] Rada Mihalcea,et al. Towards multimodal sentiment analysis: harvesting opinions from the web , 2011, ICMI '11.

[41] Morgan Sonderegger,et al. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[42] Louis-Philippe Morency,et al. Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach , 2014, ICMI.

[43] Louis-Philippe Morency,et al. UR-FUNNY: A Multimodal Language Dataset for Understanding Humor , 2019, EMNLP.

[44] Ferran Plà,et al. Spanish sentiment analysis in Twitter at the TASS workshop , 2018, Lang. Resour. Evaluation.

[45] Shrikanth S. Narayanan,et al. The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[46] Stefanos Zafeiriou,et al. RetinaFace: Single-stage Dense Face Localisation in the Wild , 2019, ArXiv.

[47] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[48] Sen Wang,et al. Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[49] José Carlos González,et al. TASS - Workshop on Sentiment Analysis at SEPLN , 2013, Proces. del Leng. Natural.

[50] Louis-Philippe Morency,et al. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[51] Parisa Kordjamshidi,et al. Spatial Language Understanding with Multimodal Graphs using Declarative Learning based Programming , 2017, SPNLP@EMNLP.

[52] Verónica Pérez-Rosas,et al. Utterance-Level Multimodal Sentiment Analysis , 2013, ACL.

[53] Jianhai Zhang,et al. Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling , 2019, NeurIPS.

[54] John Kane,et al. COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[56] Swati Gupta,et al. Multimodal sentiment analysis: Sentiment analysis using audiovisual format , 2015, 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom).

[57] Louis-Philippe Morency,et al. Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages , 2016, IEEE Intelligent Systems.

[58] Carlos Busso,et al. IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[59] Guillermo Moncecchi,et al. A Crowd-Annotated Spanish Corpus for Humor Analysis , 2017, SocialNLP@ACL.

[60] P. Perniss. Why We Should Study Multimodal Language , 2018, Front. Psychol..

[61] Fatih Uzdilli,et al. A Twitter Corpus and Benchmark Resources for German Sentiment Analysis , 2017, SocialNLP@EACL.

[62] Yingyu Liang,et al. Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis , 2019, AAAI.

[63] Sebastian Zepf,et al. Towards Multimodal Emotion Recognition in German Speech Events in Cars using Transfer Learning , 2019, KONVENS.

[64] Verónica Pérez-Rosas,et al. Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper) , 2019, ACL.

[65] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[66] Fabien Ringeval,et al. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[67] P. Ekman,et al. Facial signs of emotional experience. , 1980 .

[68] Ruslan Salakhutdinov,et al. Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization , 2019, ACL.

[69] P. Alku,et al. Normalized amplitude quotient for parametrization of the glottal flow. , 2002, The Journal of the Acoustical Society of America.

[70] Ruslan Salakhutdinov,et al. Learning Factorized Multimodal Representations , 2018, ICLR.

[71] Erik Cambria,et al. Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[72] Guillermo Moncecchi,et al. Is This a Joke? Detecting Humor in Spanish Tweets , 2016, IBERAMIA.