Multimodal Fusion of Appearance Features, Optical Flow and Accelerometer Data for Speech Detection
暂无分享,去创建一个
In this paper we examine the task of automatic detection of speech without microphones, using an overhead camera and wearable accelerometers. For this purpose, we propose the extraction of hand-crafted appearance and optical flow features from the video modality, and time-domain features from the accelerometer data. We evaluate the performance of the separate modalities in a large dataset of over 25 hours of standing conversation between multiple individuals. Finally, we show that applying a multimodal late fusion technique can lead to a performance boost in most cases.
[1] Zhonglei Gu,et al. Analyzing Human Behavior in Subspace: Dimensionality Reduction + Classification , 2018, MediaEval.
[2] Ekin Gedik,et al. Transductive Parameter Transfer, Bags of Dense Trajectories and MILES for No-Audio Multimodal Speech Detection , 2018, MediaEval.
[3] Ekin Gedik,et al. No-Audio Multimodal Speech Detection Task at MediaEval 2019 , 2019, MediaEval.