论文信息 - MultiBench: Multiscale Benchmarks for Multimodal Representation Learning

MultiBench: Multiscale Benchmarks for Multimodal Representation Learning

Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world applications in multimedia, affective computing, robotics, finance, human-computer interaction, and healthcare. Unfortunately, multimodal research has seen limited resources to study (1) generalization across domains and modalities, (2) complexity during training and inference, and (3) robustness to noisy and missing modalities. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MULTIBENCH, a systematic and unified large-scale benchmark for multimodal learning spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. MULTIBENCH provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, MULTIBENCH offers a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MULTIBENCH introduces impactful challenges for future research, including scalability to large-scale multimodal datasets and robustness to realistic imperfections. To accompany this benchmark, we also provide a standardized implementation of 20 core approaches in multimodal learning spanning innovations in fusion paradigms, optimization objectives, and training approaches. Simply applying methods proposed in different research areas can improve the state-of-the-art performance on 9/15 datasets. Therefore, MULTIBENCH presents a milestone in unifying disjoint efforts in multimodal machine learning research and paves the way towards a better understanding of the capabilities and limitations of multimodal models, all the while ensuring ease of use, accessibility, and reproducibility. MULTIBENCH, our standardized implementations, and leaderboards are publicly available, will be regularly updated, and welcomes inputs from the community.

[1] P. Ekman. Universal facial expressions of emotion. , 1970 .

[2] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[3] W. Buxton. Human-Computer Interaction , 1988, Springer Berlin Heidelberg.

[4] D G Childers,et al. Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[5] Rosalind W. Picard. Affective Computing , 1997 .

[6] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[7] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8] Yoshua Bengio,et al. Convolutional networks for images, speech, and time series , 1998 .

[9] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[10] Z. Obrenovic,et al. Modeling multimodal human-computer interaction , 2004, Computer.

[11] Kenith V. Sobel,et al. PSYCHOLOGICAL SCIENCE Research Article Neural Synergy Between Kinetic Vision and Touch , 2022 .

[12] Matthew G. Rhodes,et al. An own-age bias in face recognition for children and older adults , 2005, Psychonomic bulletin & review.

[13] Mei-Chen Yeh,et al. Fast Human Detection Using a Cascade of Histograms of Oriented Gradients , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[14] John R. Smith,et al. Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[15] A. Raftery,et al. Probabilistic forecasts, calibration and sharpness , 2007 .

[16] Carlos Busso,et al. IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[17] Mark Liberman,et al. Speaker identification on the SCOTUS corpus , 2008 .

[18] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[19] Sharon L. Oviatt,et al. Multimodal Interfaces: A Survey of Principles, Models and Frameworks , 2009, Human Machine Interaction.

[20] Shourya Roy,et al. A survey of types of text noise and techniques to handle noisy text , 2009, AND '09.

[21] Charalampos Bratsas,et al. On the Classification of Emotional Biosignals Evoked While Viewing Affective Pictures: An Integrated Data-Mining-Based Approach for Healthcare Applications , 2010, IEEE Transactions on Information Technology in Biomedicine.

[22] Wolfgang Minker,et al. Emotion recognition and adaptation in spoken dialogue systems , 2010, Int. J. Speech Technol..

[23] Patrick Langdon,et al. Accessible UI Design and Multimodal Interaction through Hybrid TV Platforms: Towards a Virtual-User Centered Design Framework , 2011, HCI.

[24] Fernando De la Torre,et al. Facial Expression Analysis , 2011, Visual Analysis of Humans.

[25] Abeer Alwan,et al. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[26] B. Scassellati,et al. Robots for use in autism research. , 2012, Annual review of biomedical engineering.

[27] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28] Elizabeth S. Kim,et al. Social Robots as Embedded Reinforcers of Social Behavior in Children with Autism , 2012, Journal of Autism and Developmental Disorders.

[29] John Kane,et al. Wavelet Maxima Dispersion for Breathy to Tense Voice Discrimination , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[30] Khaled El Emam,et al. Practicing Differential Privacy in Health Care: A Review , 2013, Trans. Data Priv..

[31] Bernhard Schölkopf,et al. Domain Adaptation under Target and Conditional Shift , 2013, ICML.

[32] Andrew Y. Ng,et al. Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[33] Yoshua Bengio,et al. Maxout Networks , 2013, ICML.

[34] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.

[35] John Kane,et al. COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[37] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[38] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[39] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[40] Francis Ferraro,et al. A Survey of Current Datasets for Vision and Language Research , 2015, EMNLP.

[41] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[42] Brijendra Kumar Joshi,et al. A Review Paper: Noise Models in Digital Image Processing , 2015, ArXiv.

[43] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[44] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[45] Hal Daumé,et al. Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[46] Colin Raffel,et al. librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[47] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[48] Jeff A. Bilmes,et al. On Deep Multi-View Representation Learning , 2015, ICML.

[49] Mohamed Abouelenien,et al. Deception Detection using Real-life Trial Data , 2015, ICMI.

[50] Louis-Philippe Morency,et al. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[51] Andrew D. Selbst,et al. Big Data's Disparate Impact , 2016 .

[52] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53] Mahesh K. Marina,et al. Towards multimodal deep learning for activity recognition on mobile devices , 2016, UbiComp Adjunct.

[54] Dhruv Batra,et al. Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[55] Peter Szolovits,et al. MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[56] Sabine Tan,et al. Multimodal research: Addressing the complexity of multimodal environments and the challenges for CALL , 2016, ReCALL.

[57] Adam Tauman Kalai,et al. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[58] Kuan-Ting Yu,et al. More than a million ways to be pushed. A high-fidelity experimental dataset of planar pushing , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[59] Peter Robinson,et al. OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[60] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[61] Zoubin Ghahramani,et al. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[62] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[63] Fabio A. González,et al. Gated Multimodal Units for Information Fusion , 2017, ICLR.

[64] Luis Herranz,et al. Being a Supercook: Joint Food Attributes and Multimodal Content Modeling for Recipe Retrieval and Exploration , 2017, IEEE Transactions on Multimedia.

[65] Erik Cambria,et al. Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[66] Dumitru Erhan,et al. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67] Jeffrey Nichols,et al. Rico: A Mobile App Dataset for Building Data-Driven Design Applications , 2017, UIST.

[68] Erik Cambria,et al. A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[69] Sen Wang,et al. Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[70] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[71] Yann Dauphin,et al. Language Modeling with Gated Convolutional Networks , 2016, ICML.

[72] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73] Tassilo Klein,et al. Differentially Private Federated Learning: A Client Level Perspective , 2017, ArXiv.

[74] Arvind Narayanan,et al. Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[75] Erik Cambria,et al. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[76] Maria Liakata,et al. Using clinical Natural Language Processing for health outcomes research: Overview and actionable suggestions for future advances , 2018, J. Biomed. Informatics.

[77] Yan Liu,et al. Benchmarking deep learning models on large healthcare datasets , 2018, J. Biomed. Informatics.

[78] Frédéric Jurie,et al. CentralNet: a Multilayer Approach for Multimodal Fusion , 2018, ECCV Workshops.

[79] Mike Wu,et al. Multimodal Generative Models for Scalable Weakly-Supervised Learning , 2018, NeurIPS.

[80] Louis-Philippe Morency,et al. Visual Referring Expression Recognition: What Do Systems Actually Learn? , 2018, NAACL.

[81] Yonatan Belinkov,et al. Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[82] Aaron C. Courville,et al. FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[83] Louis-Philippe Morency,et al. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.

[84] Ruslan Salakhutdinov,et al. Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[85] Gianluca Demartini,et al. Investigating User Perception of Gender Bias in Image Search: The Role of Sexism , 2018, SIGIR.

[86] Paul Pu Liang,et al. Computational Modeling of Human Multimodal Language : The MOSEI Dataset and Interpretable Dynamic Fusion , 2018 .

[87] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[88] Brian Scassellati,et al. Social robots for education: A review , 2018, Science Robotics.

[89] Louis-Philippe Morency,et al. Multimodal Language Analysis with Recurrent Multistage Fusion , 2018, EMNLP.

[90] Trevor Darrell,et al. Women also Snowboard: Overcoming Bias in Captioning Models , 2018, ECCV.

[91] Louis-Philippe Morency,et al. Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[92] Suresh Manandhar,et al. Multimodal deep learning for short-term stock volatility prediction , 2018, ArXiv.

[93] Kirsten Lloyd,et al. Bias Amplification in Artificial Intelligence Systems , 2018, ArXiv.

[94] Markus A. Höllerer,et al. ‘A Picture is Worth a Thousand Words’: Multimodal Sensemaking of the Global Financial Crisis , 2018 .

[95] Stefan Lee,et al. Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[96] Ruslan Salakhutdinov,et al. Learning Factorized Multimodal Representations , 2018, ICLR.

[97] Douwe Kiela,et al. Supervised Multimodal Bitransformers for Classifying Images and Text , 2019, ViGIL@NeurIPS.

[98] Jianfeng Gao,et al. Robust Navigation with Language Pretraining and Stochastic Sampling , 2019, EMNLP.

[99] Stephen H. Fairclough,et al. Embedded multimodal interfaces in robotics: applications, future trends, and societal implications , 2019, The Handbook of Multimodal-Multisensor Interfaces, Volume 3.

[100] Louis-Philippe Morency,et al. UR-FUNNY: A Multimodal Language Dataset for Understanding Humor , 2019, EMNLP.

[101] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[102] Geoffrey J. Gordon,et al. Inherent Tradeoffs in Learning Fair Representations , 2019, NeurIPS.

[103] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[104] Verónica Pérez-Rosas,et al. Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper) , 2019, ACL.

[105] Stephan Günnemann,et al. Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift , 2018, NeurIPS.

[106] Louis-Philippe Morency,et al. Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[107] Inioluwa Deborah Raji,et al. Model Cards for Model Reporting , 2018, FAT.

[108] Robert Dale,et al. Law and Word Order: NLP in Legal Tech , 2018, Natural Language Engineering.

[109] Pedro H. O. Pinheiro,et al. Adaptive Cross-Modal Few-Shot Learning , 2019, NeurIPS.

[110] Shimon Whiteson,et al. A Survey of Reinforcement Learning Informed by Natural Language , 2019, IJCAI.

[111] L. Kaelbling,et al. Omnipush: accurate, diverse, real-world dataset of pushing dynamics with RGB-D video , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[112] Ruslan Salakhutdinov,et al. Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization , 2019, ACL.

[113] Ruslan Salakhutdinov,et al. Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[114] R Devon Hjelm,et al. Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[115] Matthieu Cord,et al. RUBi: Reducing Unimodal Biases in Visual Question Answering , 2019, NeurIPS.

[116] Dezhong Peng,et al. Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[117] Jianhai Zhang,et al. Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling , 2019, NeurIPS.

[118] Marcus Rohrbach,et al. Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering , 2019, ICML.

[119] Pengtao Xie,et al. Multimodal Machine Learning for Automated ICD Coding , 2018, MLHC.

[120] Barnabás Póczos,et al. Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[121] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[122] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[123] Louis-Philippe Morency,et al. Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[124] Ruslan Salakhutdinov,et al. Deep Gamblers: Learning to Abstain with Portfolio Theory , 2019, NeurIPS.

[125] Frédéric Jurie,et al. MFAS: Multimodal Fusion Architecture Search , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[126] Silvio Savarese,et al. Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[127] Amisha,et al. Overview of artificial intelligence in medicine , 2019, Journal of family medicine and primary care.

[128] Rada Mihalcea,et al. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[129] Du Tran,et al. What Makes Training Multi-Modal Classification Networks Hard? , 2019, Computer Vision and Pattern Recognition.

[130] Ruslan Salakhutdinov,et al. Towards Debiasing Sentence Representations , 2020, ACL.

[131] Silvio Savarese,et al. Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks , 2019, IEEE Transactions on Robotics.

[132] Michelle A. Lee,et al. Multimodal Sensor Fusion with Differentiable Filters , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[133] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[134] Tuomas Virtanen,et al. Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[135] Edward Grefenstette,et al. RTFM: Generalising to Novel Environment Dynamics via Reading , 2019, ArXiv.

[136] Yingyu Liang,et al. Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis , 2019, AAAI.

[137] Douwe Kiela,et al. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes , 2020, NeurIPS.

[138] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[139] Dong Yang,et al. Uncertainty-aware multi-view co-training for semi-supervised medical image segmentation and domain adaptation , 2020, Medical Image Anal..

[140] A. Gupta,et al. See, Hear, Explore: Curiosity via Audio-Visual Association , 2020, NeurIPS.

[141] Benjamin Recht,et al. Measuring Robustness to Natural Distribution Shifts in Image Classification , 2020, NeurIPS.

[142] J. Leskovec,et al. Open Graph Benchmark: Datasets for Machine Learning on Graphs , 2020, NeurIPS.

[143] Xiaojun Wan,et al. Multimodal Transformer for Multimodal Machine Translation , 2020, ACL.

[144] Mohit Bansal,et al. ManyModalQA: Modality Disambiguation and QA over Diverse Inputs , 2020, AAAI.

[145] Louis-Philippe Morency,et al. Foundations of Multimodal Co-learning , 2020, Inf. Fusion.

[146] Mohammed Bennamoun,et al. Vision to Language: Methods, Metrics and Datasets , 2020, Learning and Analytics in Intelligent Systems.