Vision-based Active Speaker Detection in Multiparty Interaction
暂无分享,去创建一个
[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[2] Harriet J. Nock,et al. Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.
[3] Sudeep Sarkar,et al. Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization , 2008, IEEE Transactions on Circuits and Systems for Video Technology.
[4] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[5] Jingwen Dai,et al. Deep Multimodal Speaker Naming , 2015, ACM Multimedia.
[6] Paul A. Viola,et al. Boosting-Based Multimodal Speaker Detection for Distributed Meetings , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.
[7] Akihiro Sugimoto,et al. Look who's talking: visual identification of the active speaker in multi-party human-robot interaction , 2016, ASSP4MI '16.
[8] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.
[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[10] Chuohao Yeo,et al. Visual speaker localization aided by acoustic models , 2009, MM '09.
[11] Malcolm Slaney,et al. FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.
[12] Sileye O. Ba,et al. Speech/Non-Speech Detection in Meetings from Automatically Extracted low Resolution Visual Features , 2010, ICASSP.
[13] Murat Kunt,et al. Hypothesis testing for evaluating a multimodal pattern recognition framework applied to speaker detection , 2007, Journal of NeuroEngineering and Rehabilitation.
[14] Chuan Wang,et al. Look, Listen and Learn - A Multimodal LSTM for Speaker Identification , 2016, AAAI.
[15] Gabriel Skantze,et al. IrisTK: a statechart-based toolkit for multi-party face-to-face interaction , 2012, ICMI '12.
[16] Nicholas W. D. Evans,et al. Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.
[17] Trevor Darrell,et al. Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.
[18] Jonas Beskow,et al. A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction , 2016, LREC.