Multimodal Attention Network for Trauma Activity Recognition from Spoken Language and Environmental Sound

Trauma activity recognition aims to detect, recognize, and predict the activities (or tasks) during trauma resuscitation. Previous work has mainly focused on using various sensor data including image, RFID, and vital signals to generate the trauma event log. However, spoken language and environmental sound, which contain rich communication and contextual information necessary for trauma team cooperation, are still largely ignored. In this paper, we propose a multimodal attention network (MAN) that uses both verbal transcripts and environmental audio stream as input; the model extracts textual and acoustic features using a multi-level multi-head attention module, and forms a final shared representation for trauma activity classification. We evaluated the proposed architecture on 75 actual trauma resuscitation cases collected from a hospital. We achieved 71.8% accuracy with 0.702 F1 score, demonstrating that our proposed architecture is useful and efficient. These results also show that using spoken language and environmental audio indeed helps identify hard-to-recognize activities, compared to previous approaches. We also provide a detailed analysis of the performance and generalization of the proposed multimodal attention network.

[1]  Nassir Navab,et al.  Statistical modeling and recognition of surgical workflow , 2012, Medical Image Anal..

[2]  Ivan Marsic,et al.  Activity recognition for medical teamwork based on passive RFID , 2016, 2016 IEEE International Conference on RFID (RFID).

[3]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[4]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[5]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[6]  Ivan Marsic,et al.  Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment , 2018, ACL.

[7]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[8]  A. James 2010 , 2011, Philo of Alexandria: an Annotated Bibliography 2007-2016.

[9]  Ivan Marsic,et al.  Hybrid Attention based Multimodal Network for Spoken Language Classification , 2018, COLING.

[10]  Ivan Marsic,et al.  Speech Intention Classification with Multimodal Deep Learning , 2017, Canadian Conference on AI.

[11]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[12]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[13]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Ivan Marsic,et al.  Language-Based Process Phase Detection in the Trauma Resuscitation , 2017, 2017 IEEE International Conference on Healthcare Informatics (ICHI).

[16]  Ivan Marsic,et al.  Online process phase detection using multimodal deep learning , 2016, 2016 IEEE 7th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON).

[17]  Ah Chung Tsoi,et al.  Face recognition: a convolutional neural-network approach , 1997, IEEE Trans. Neural Networks.

[18]  E. A. Bergs,et al.  Communication during trauma resuscitation: do we know what is happening? , 2005, Injury.

[19]  Jakob E. Bardram,et al.  Phase recognition during surgical procedures using embedded and body-worn sensors , 2011, 2011 IEEE International Conference on Pervasive Computing and Communications (PerCom).

[20]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[21]  Ivan Marsic,et al.  Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder , 2018, ACM Multimedia.

[22]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[25]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Ivan Marsic,et al.  Deep Learning for RFID-Based Activity Recognition , 2016, SenSys.

[27]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.