论文信息 - MultiBench: Multiscale Benchmarks for Multimodal Representation Learning

MultiBench: Multiscale Benchmarks for Multimodal Representation Learning

Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world applications in multimedia, affective computing, robotics, finance, human-computer interaction, and healthcare. Unfortunately, multimodal research has seen limited resources to study (1) generalization across domains and modalities, (2) complexity during training and inference, and (3) robustness to noisy and missing modalities. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiBench, a systematic and unified large-scale benchmark for multimodal learning spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. MultiBench provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, MultiBench offers a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MultiBench introduces impactful challenges for future research, including scalability to large-scale multimodal datasets and robustness to realistic imperfections. To accompany this benchmark, we also provide a standardized implementation of 20 core approaches in multimodal learning spanning innovations in fusion paradigms, optimization objectives, and training approaches. Simply applying methods proposed in different research areas can improve the state-of-the-art performance on 9/15 datasets. Therefore, MultiBench presents a milestone in unifying disjoint efforts in multimodal machine learning research and paves the way towards a better understanding of the capabilities and limitations of multimodal models, all the while ensuring ease of use, accessibility, and reproducibility. MultiBench, our standardized implementations, and leaderboards are publicly available, will be regularly updated, and welcomes inputs from the community.

[1] Yonatan Bisk,et al. Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models , 2021, GEBNLP.

[2] Jonathan Berant,et al. MultiModalQA: Complex Question Answering over Text, Tables and Images , 2021, ICLR.

[3] Bertrand Schneider,et al. Multimodal Data Collection Made Easy: The EZ-MMLA Toolkit: A data collection website that provides educators and researchers with easy access to multimodal data streams. , 2021, LAK.

[4] Sethuraman Sankaran,et al. Multimodal Fusion Refiner Networks , 2021, ArXiv.

[5] Sergey Tulyakov,et al. SMIL: Multimodal Learning with Severely Missing Modality , 2021, AAAI.

[6] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[7] Andrew M. Dai,et al. MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records , 2021, AAAI.

[8] Christina Lioma,et al. What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis , 2021, Inf. Fusion.

[9] Jasper Snoek,et al. Second opinion needed: communicating uncertainty in medical machine learning , 2021, npj Digital Medicine.

[10] Pang Wei Koh,et al. WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.

[11] Ruslan Salakhutdinov,et al. Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment , 2020, ACM Multimedia.

[12] Yuke Zhu,et al. Detect, Reject, Correct: Crossmodal Compensation of Corrupted Sensors , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[13] Louis-Philippe Morency,et al. Foundations of Multimodal Co-learning , 2020, Inf. Fusion.

[14] Willie Neiswanger,et al. Beyond Pinball Loss: Quantile Methods for Calibrated Uncertainty Quantification , 2020, NeurIPS.

[15] Liu Yang,et al. Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[16] Louis-Philippe Morency,et al. MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French , 2020, EMNLP.

[17] Michelle A. Lee,et al. Multimodal Sensor Fusion with Differentiable Filters , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[19] Tamir Hazan,et al. Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies , 2020, NeurIPS.

[20] Wei Wang,et al. Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning , 2020, EMNLP.

[21] Jack Hessel,et al. Does My Multimodal Model Learn Cross-modal Interactions? It’s Harder to Tell than You Might Think! , 2020, EMNLP.

[22] Luis A. Leiva,et al. Enrico: A Dataset for Topic Modeling of Mobile UI Designs , 2020, MobileHCI.

[23] A. Gupta,et al. See, Hear, Explore: Curiosity via Audio-Visual Association , 2020, NeurIPS.

[24] Ruslan Salakhutdinov,et al. Towards Debiasing Sentence Representations , 2020, ACL.

[25] Xiaojun Wan,et al. Multimodal Transformer for Multimodal Machine Translation , 2020, ACL.

[26] Nicholas Carlini,et al. Measuring Robustness to Natural Distribution Shifts in Image Classification , 2020, NeurIPS.

[27] Dong Yang,et al. Uncertainty-aware multi-view co-training for semi-supervised medical image segmentation and domain adaptation , 2020, Medical Image Anal..

[28] Douwe Kiela,et al. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes , 2020, NeurIPS.

[29] J. Leskovec,et al. Open Graph Benchmark: Datasets for Machine Learning on Graphs , 2020, NeurIPS.

[30] Yee Whye Teh,et al. Multiplicative Interactions and Where to Find Them , 2020, ICLR.

[31] Edward Grefenstette,et al. RTFM: Generalising to New Environment Dynamics via Reading , 2020, ICLR.

[32] Yao-Hung Hubert Tsai,et al. Multimodal Routing: Improving Local and Global Interpretability of Multimodal Language Analysis , 2020, EMNLP.

[33] Orhan Firat,et al. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[34] Andrei Barbu,et al. Measuring Social Biases in Grounded Vision and Language Embeddings , 2020, NAACL.

[35] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[36] Mohit Bansal,et al. ManyModalQA: Modality Disambiguation and QA over Diverse Inputs , 2020, AAAI.

[37] Ruslan Salakhutdinov,et al. Think Locally, Act Globally: Federated Learning with Local and Global Representations , 2020, ArXiv.

[38] Abeer Alwan,et al. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[39] Zachary W. Ulissi,et al. Methods for comparing uncertainty quantifications for material property predictions , 2019, Mach. Learn. Sci. Technol..

[40] Yingyu Liang,et al. Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis , 2019, AAAI.

[41] Tuomas Virtanen,et al. Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42] L. Kaelbling,et al. Omnipush: accurate, diverse, real-world dataset of pushing dynamics with RGB-D video , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[43] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[44] Douwe Kiela,et al. Supervised Multimodal Bitransformers for Classifying Images and Text , 2019, ViGIL@NeurIPS.

[45] Jianfeng Gao,et al. Robust Navigation with Language Pretraining and Stochastic Sampling , 2019, EMNLP.

[46] Kristina Lerman,et al. A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[47] Jifeng Dai,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[48] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[49] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[50] Silvio Savarese,et al. Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks , 2019, IEEE Transactions on Robotics.

[51] Marzyeh Ghassemi,et al. MIMIC-Extract: a data extraction, preprocessing, and representation pipeline for MIMIC-III , 2019, CHIL.

[52] Stephen H. Fairclough,et al. Embedded multimodal interfaces in robotics: applications, future trends, and societal implications , 2019, The Handbook of Multimodal-Multisensor Interfaces, Volume 3.

[53] Ruslan Salakhutdinov,et al. Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization , 2019, ACL.

[54] Amisha,et al. Overview of artificial intelligence in medicine , 2019, Journal of family medicine and primary care.

[55] Ruslan Salakhutdinov,et al. Deep Gamblers: Learning to Abstain with Portfolio Theory , 2019, NeurIPS.

[56] Han Zhao,et al. Inherent Tradeoffs in Learning Fair Representation , 2019, NeurIPS.

[57] Phillip Isola,et al. Contrastive Multiview Coding , 2019, ECCV.

[58] Shimon Whiteson,et al. A Survey of Reinforcement Learning Informed by Natural Language , 2019, IJCAI.

[59] Verónica Pérez-Rosas,et al. Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper) , 2019, ACL.

[60] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[61] R Devon Hjelm,et al. Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[62] Ruslan Salakhutdinov,et al. Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[63] Matthieu Cord,et al. RUBi: Reducing Unimodal Biases in Visual Question Answering , 2019, NeurIPS.

[64] Louis-Philippe Morency,et al. Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65] Dezhong Peng,et al. Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Du Tran,et al. What Makes Training Multi-Modal Classification Networks Hard? , 2019, Computer Vision and Pattern Recognition.

[67] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[68] Louis-Philippe Morency,et al. UR-FUNNY: A Multimodal Language Dataset for Understanding Humor , 2019, EMNLP.

[69] Frédéric Jurie,et al. MFAS: Multimodal Fusion Architecture Search , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70] Marcus Rohrbach,et al. Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering , 2019, ICML.

[71] Pedro H. O. Pinheiro,et al. Adaptive Cross-Modal Few-Shot Learning , 2019, NeurIPS.

[72] Barnabás Póczos,et al. Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , 2018, AAAI.

[73] Robert Dale,et al. Law and Word Order: NLP in Legal Tech , 2018, Natural Language Engineering.

[74] Anit Kumar Sahu,et al. Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[75] Suresh Manandhar,et al. Multimodal deep learning for short-term stock volatility prediction , 2018, ArXiv.

[76] Pengtao Xie,et al. Multimodal Machine Learning for Automated ICD Coding , 2018, MLHC.

[77] Stephan Günnemann,et al. Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift , 2018, NeurIPS.

[78] Silvio Savarese,et al. Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[79] Maria Liakata,et al. Using clinical Natural Language Processing for health outcomes research: Overview and actionable suggestions for future advances , 2018, J. Biomed. Informatics.

[80] Inioluwa Deborah Raji,et al. Model Cards for Model Reporting , 2018, FAT.

[81] Rada Mihalcea,et al. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[82] Kirsten Lloyd,et al. Bias Amplification in Artificial Intelligence Systems , 2018, ArXiv.

[83] Frédéric Jurie,et al. CentralNet: a Multilayer Approach for Multimodal Fusion , 2018, ECCV Workshops.

[84] Brian Scassellati,et al. Social robots for education: A review , 2018, Science Robotics.

[85] Louis-Philippe Morency,et al. Multimodal Language Analysis with Recurrent Multistage Fusion , 2018, EMNLP.

[86] Yan Liu,et al. Benchmarking deep learning models on large healthcare datasets , 2018, J. Biomed. Informatics.

[87] Erik Cambria,et al. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[88] Gianluca Demartini,et al. Investigating User Perception of Gender Bias in Image Search: The Role of Sexism , 2018, SIGIR.

[89] Ruslan Salakhutdinov,et al. Learning Factorized Multimodal Representations , 2018, ICLR.

[90] Louis-Philippe Morency,et al. Visual Referring Expression Recognition: What Do Systems Actually Learn? , 2018, NAACL.

[91] Louis-Philippe Morency,et al. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.

[92] Louis-Philippe Morency,et al. Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[93] Samuel R. Bowman,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[94] Markus A. Höllerer,et al. ‘A Picture is Worth a Thousand Words’: Multimodal Sensemaking of the Global Financial Crisis , 2018 .

[95] Trevor Darrell,et al. Women also Snowboard: Overcoming Bias in Captioning Models , 2018, ECCV.

[96] Timnit Gebru,et al. Datasheets for datasets , 2018, Commun. ACM.

[97] Mike Wu,et al. Multimodal Generative Models for Scalable Weakly-Supervised Learning , 2018, NeurIPS.

[98] Tassilo Klein,et al. Differentially Private Federated Learning: A Client Level Perspective , 2017, ArXiv.

[99] Georgia Gkioxari,et al. Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[100] Yonatan Belinkov,et al. Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[101] Sen Wang,et al. Multimodal sentiment analysis with word-level fusion and reinforcement learning , 2017, ICMI.

[102] Jeffrey Nichols,et al. Rico: A Mobile App Dataset for Building Data-Driven Design Applications , 2017, UIST.

[103] Aaron C. Courville,et al. FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[104] Erik Cambria,et al. A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[105] Erik Cambria,et al. Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[106] Ruslan Salakhutdinov,et al. Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[107] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[108] Louis-Philippe Morency,et al. Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[109] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[110] Luis Herranz,et al. Being a Supercook: Joint Food Attributes and Multimodal Content Modeling for Recipe Retrieval and Exploration , 2017, IEEE Transactions on Multimedia.

[111] Alexander J. Smola,et al. Deep Sets , 2017, 1703.06114.

[112] Fabio A. González,et al. Gated Multimodal Units for Information Fusion , 2017, ICLR.

[113] Yann Dauphin,et al. Language Modeling with Gated Convolutional Networks , 2016, ICML.

[114] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[115] Quoc V. Le,et al. HyperNetworks , 2016, ICLR.

[116] Dumitru Erhan,et al. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[117] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[118] Mahesh K. Marina,et al. Towards multimodal deep learning for activity recognition on mobile devices , 2016, UbiComp Adjunct.

[119] Joanna Bryson,et al. Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[120] Adam Tauman Kalai,et al. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[121] Peter F. Wignell,et al. Multimodal research: Addressing the complexity of multimodal environments and the challenges for CALL , 2016, ReCALL.

[122] Louis-Philippe Morency,et al. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos , 2016, ArXiv.

[123] Dhruv Batra,et al. Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[124] Peter Szolovits,et al. MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[125] Kuan-Ting Yu,et al. More than a million ways to be pushed. A high-fidelity experimental dataset of planar pushing , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[126] Peter Robinson,et al. OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[127] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[128] Rada Mihalcea,et al. Deception Detection using Real-life Trial Data , 2015, ICMI.

[129] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[130] Jeff A. Bilmes,et al. On Deep Multi-View Representation Learning , 2015, ICML.

[131] Jordan L. Boyd-Graber,et al. Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[132] Francis Ferraro,et al. A Survey of Current Datasets for Vision and Language Research , 2015, EMNLP.

[133] Zoubin Ghahramani,et al. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[134] Brijendra Kumar Joshi,et al. A Review Paper: Noise Models in Digital Image Processing , 2015, ArXiv.

[135] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[136] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[137] H. Ip,et al. Human Computer Interaction , 2015, Lecture Notes in Computer Science.

[138] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[139] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[140] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[141] John Kane,et al. COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[142] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[143] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[144] Bernhard Schölkopf,et al. Domain Adaptation under Target and Conditional Shift , 2013, ICML.

[145] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.

[146] John Kane,et al. Wavelet Maxima Dispersion for Breathy to Tense Voice Discrimination , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[147] Khaled El Emam,et al. Practicing Differential Privacy in Health Care: A Review , 2013, Trans. Data Priv..

[148] Yoshua Bengio,et al. Maxout Networks , 2013, ICML.

[149] Andrew Y. Ng,et al. Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[150] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[151] Elizabeth S. Kim,et al. Social Robots as Embedded Reinforcers of Social Behavior in Children with Autism , 2012, Journal of Autism and Developmental Disorders.

[152] B. Scassellati,et al. Robots for use in autism research. , 2012, Annual review of biomedical engineering.

[153] Patrick Langdon,et al. Accessible UI Design and Multimodal Interaction through Hybrid TV Platforms: Towards a Virtual-User Centered Design Framework , 2011, HCI.

[154] Wolfgang Minker,et al. Emotion recognition and adaptation in spoken dialogue systems , 2010, Int. J. Speech Technol..

[155] Charalampos Bratsas,et al. On the Classification of Emotional Biosignals Evoked While Viewing Affective Pictures: An Integrated Data-Mining-Based Approach for Healthcare Applications , 2010, IEEE Transactions on Information Technology in Biomedicine.

[156] Shourya Roy,et al. A survey of types of text noise and techniques to handle noisy text , 2009, AND '09.

[157] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[158] Sharon L. Oviatt,et al. Multimodal Interfaces: A Survey of Principles, Models and Frameworks , 2009, Human Machine Interaction.

[159] Carlos Busso,et al. IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[160] Mark Liberman,et al. Speaker identification on the SCOTUS corpus , 2008 .

[161] A. Raftery,et al. Probabilistic forecasts, calibration and sharpness , 2007 .

[162] John R. Smith,et al. Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[163] Mei-Chen Yeh,et al. Fast Human Detection Using a Cascade of Histograms of Oriented Gradients , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[164] Matthew G. Rhodes,et al. An own-age bias in face recognition for children and older adults , 2005, Psychonomic bulletin & review.

[165] Fedor V. Fomin,et al. Computing branchwidth via efficient triangulations and blocks , 2005, Discret. Appl. Math..

[166] Z. Obrenovic,et al. Modeling multimodal human-computer interaction , 2004, Computer.

[167] Kenith V. Sobel,et al. PSYCHOLOGICAL SCIENCE Research Article Neural Synergy Between Kinetic Vision and Touch , 2022 .

[168] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[169] Yoshua Bengio,et al. Convolutional networks for images, speech, and time series , 1998 .

[170] S. Hochreiter,et al. Long Short-Term Memory , 1997, Neural Computation.

[171] D G Childers,et al. Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[172] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[173] Ronghang Hu,et al. Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer , 2021, ArXiv.

[174] Douglas A. Talbert,et al. Uncertainty Quantification in Multimodal Ensembles of Deep Learners , 2020, FLAIRS.

[175] Mohammed Bennamoun,et al. Vision to Language: Methods, Metrics and Datasets , 2020, Learning and Analytics in Intelligent Systems.

[176] Jianhai Zhang,et al. Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling , 2019, NeurIPS.

[177] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[178] Paul Pu Liang,et al. Computational Modeling of Human Multimodal Language : The MOSEI Dataset and Interpretable Dynamic Fusion , 2018 .

[179] Andrew D. Selbst,et al. Big Data's Disparate Impact , 2016 .

[180] Colin Raffel,et al. librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[181] Fernando De la Torre,et al. Facial Expression Analysis , 2011, Visual Analysis of Humans.

[182] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[183] P. Ekman. Universal facial expressions of emotion. , 1970 .