MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep Learning

Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiZoo, a public toolkit consisting of standardized implementations of>20 core multimodal algorithms and MultiBench, a large-scale benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. Together, these provide an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, we offer a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MultiBench paves the way towards a better understanding of the capabilities and limitations of multimodal models, while ensuring ease of use, accessibility, and reproducibility. Our toolkits are publicly available, will be regularly updated, and welcome inputs from the community.

[1]  Kwang-Ting Cheng,et al.  Architecting Efficient Multi-modal AIoT Systems , 2023, International Symposium on Computer Architecture.

[2]  Wei Xiong,et al.  An Interpretable Fusion Siamese Network for Multi-Modality Remote Sensing Ship Image Retrieval , 2023, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Jeffrey P. Bigham,et al.  WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics , 2023, CHI.

[4]  Michelle A. Lee,et al.  See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation , 2022, CoRL.

[5]  P. C. Tripathi,et al.  Multimodal Learning for Predicting Mortality in Patients with Pulmonary Arterial Hypertension , 2022, 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[6]  Paul Pu Liang,et al.  MultiViz: Towards Visualizing and Understanding Multimodal Models , 2022, ICLR.

[7]  R. Marculescu,et al.  Dynamic Multimodal Fusion , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Chang Zhou,et al.  Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably) , 2022, ICML.

[9]  R. Salakhutdinov,et al.  High-Modality Multimodal Transformer: Quantifying Modality&Interaction Heterogeneity for High-Modality Representation Learning , 2022, 2203.01311.

[10]  Paul Pu Liang,et al.  MultiBench: Multiscale Benchmarks for Multimodal Representation Learning , 2021, NeurIPS Datasets and Benchmarks.

[11]  Sethuraman Sankaran,et al.  Multimodal Fusion Refiner Networks , 2021, ArXiv.

[12]  Michelle A. Lee,et al.  Multimodal Sensor Fusion with Differentiable Filters , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[13]  Tamir Hazan,et al.  Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies , 2020, NeurIPS.

[14]  Luis A. Leiva,et al.  Enrico: A Dataset for Topic Modeling of Mobile UI Designs , 2020, MobileHCI.

[15]  Yee Whye Teh,et al.  Multiplicative Interactions and Where to Find Them , 2020, ICLR.

[16]  Stephen H. Fairclough,et al.  Embedded multimodal interfaces in robotics: applications, future trends, and societal implications , 2019, The Handbook of Multimodal-Multisensor Interfaces, Volume 3.

[17]  Ruslan Salakhutdinov,et al.  Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization , 2019, ACL.

[18]  Verónica Pérez-Rosas,et al.  Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper) , 2019, ACL.

[19]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[20]  Du Tran,et al.  What Makes Training Multi-Modal Classification Networks Hard? , 2019, Computer Vision and Pattern Recognition.

[21]  Louis-Philippe Morency,et al.  UR-FUNNY: A Multimodal Language Dataset for Understanding Humor , 2019, EMNLP.

[22]  Frédéric Jurie,et al.  MFAS: Multimodal Fusion Architecture Search , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Barnabás Póczos,et al.  Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[24]  Pengtao Xie,et al.  Multimodal Machine Learning for Automated ICD Coding , 2018, MLHC.

[25]  Silvio Savarese,et al.  Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[26]  Frédéric Jurie,et al.  CentralNet: a Multilayer Approach for Multimodal Fusion , 2018, ECCV Workshops.

[27]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[28]  Ruslan Salakhutdinov,et al.  Learning Factorized Multimodal Representations , 2018, ICLR.

[29]  Louis-Philippe Morency,et al.  Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.

[30]  Markus A. Höllerer,et al.  ‘A Picture is Worth a Thousand Words’: Multimodal Sensemaking of the Global Financial Crisis , 2018 .

[31]  Mike Wu,et al.  Multimodal Generative Models for Scalable Weakly-Supervised Learning , 2018, NeurIPS.

[32]  Sen Wang,et al.  Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[33]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[34]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[35]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[37]  Fabio A. González,et al.  Gated Multimodal Units for Information Fusion , 2017, ICLR.

[38]  Mahesh K. Marina,et al.  Towards multimodal deep learning for activity recognition on mobile devices , 2016, UbiComp Adjunct.

[39]  Louis-Philippe Morency,et al.  MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[40]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[41]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[42]  Wolfgang Minker,et al.  Emotion recognition and adaptation in spoken dialogue systems , 2010, Int. J. Speech Technol..

[43]  Charalampos Bratsas,et al.  On the Classification of Emotional Biosignals Evoked While Viewing Affective Pictures: An Integrated Data-Mining-Based Approach for Healthcare Applications , 2010, IEEE Transactions on Information Technology in Biomedicine.

[44]  Sharon L. Oviatt,et al.  Multimodal Interfaces: A Survey of Principles, Models and Frameworks , 2009, Human Machine Interaction.

[45]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[46]  Z. Obrenovic,et al.  Modeling multimodal human-computer interaction , 2004, Computer.

[47]  Paul Pu Liang,et al.  Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions , 2022, ArXiv.