Predicting meeting extracts in group discussions using multimodal convolutional neural networks

This study proposes the use of multimodal fusion models employing Convolutional Neural Networks (CNNs) to extract meeting minutes from group discussion corpus. First, unimodal models are created using raw behavioral data such as speech, head motion, and face tracking. These models are then integrated into a fusion model that works as a classifier. The main advantage of this work is that the proposed models were trained without any hand-crafted features, and they outperformed a baseline model that was trained using hand-crafted features. It was also found that multimodal fusion is useful in applying the CNN approach to model multimodal multiparty interaction.

[1]  Fumio Nihei,et al.  Meeting extracts for discussion summarization based on multimodal nonverbal information , 2016, ICMI.

[2]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[3]  Giuseppe Carenini,et al.  Summarizing Spoken and Written Conversations , 2008, EMNLP.

[4]  Claire Cardie,et al.  Focused Meeting Summarization via Unsupervised Relation Extraction , 2012, SIGDIAL Conference.

[5]  Louis-Philippe Morency,et al.  Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.

[6]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[7]  Daniel Gatica-Perez,et al.  One of a kind: inferring personality impressions in meetings , 2013, ICMI '13.

[8]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[9]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Dilek Z. Hakkani-Tür,et al.  Integrating prosodic features in extractive meeting summarization , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[11]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Julia Hirschberg,et al.  Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization , 2005, INTERSPEECH.

[14]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[15]  Shiliang Zhang,et al.  Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition , 2016, ICMR.

[16]  Hung-Hsuan Huang,et al.  Predicting Influential Statements in Group Discussions using Speech and Head Motion Information , 2014, ICMI.

[17]  Noel E. O'Connor,et al.  Shallow and Deep Convolutional Networks for Saliency Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[19]  Hermann Ney,et al.  Convolutional neural networks for acoustic modeling of raw time signal in LVCSR , 2015, INTERSPEECH.