Deep Multimodal Representation Learning from Temporal Data

In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications. When the available modalities consist of time series data such as video, audio and sensor signals, it becomes imperative to consider their temporal structure during the fusion process. In this paper, we propose the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion model for fusing multiple input modalities that are inherently temporal in nature. Key features of our proposed model include: (i) simultaneous learning of the joint representation and temporal dependencies between modalities, (ii) use of multiple loss terms in the objective function, including a maximum correlation loss term to enhance learning of cross-modal information, and (iii) the use of an attention model to dynamically adjust the contribution of different input modalities to the joint representation. We validate our model via experimentation on two different tasks: video-and sensor-based activity classification, and audio-visual speech recognition. We empirically analyze the contributions of different components of the proposed CorrRNN model, and demonstrate its robustness, effectiveness and state-of-the-art performance on multiple datasets.

[1]  Honglak Lee,et al.  Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[2]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[3]  Chuan Wang,et al.  Look, Listen and Learn - A Multimodal LSTM for Speaker Identification , 2016, AAAI.

[4]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[5]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[6]  Aggelos K. Katsaggelos,et al.  Audiovisual Fusion: Challenges and New Approaches , 2015, Proceedings of the IEEE.

[7]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[10]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[11]  Daniel Roggen,et al.  Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[12]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[13]  Mohammed Bennamoun,et al.  Listening with Your Eyes: Towards a Practical Visual Speech Recognition System Using Deep Boltzmann Machines , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[15]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[16]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[17]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[18]  Paul Lukowicz,et al.  Collecting complex activity datasets in highly rich networked sensor environments , 2010, 2010 Seventh International Conference on Networked Sensing Systems (INSS).

[19]  Mohamed R. Amer,et al.  Multimodal fusion using dynamic hybrid models , 2014, IEEE Winter Conference on Applications of Computer Vision.

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Raja Bala,et al.  On-the-fly hand detection training with application in egocentric action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[24]  Xuelong Li,et al.  Temporal Multimodal Learning in Audiovisual Speech Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Hugo Larochelle,et al.  Correlational Neural Networks , 2015, Neural Computation.