论文信息 - A Data Driven Approach to Audiovisual Speech Mapping

A Data Driven Approach to Audiovisual Speech Mapping

The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual (2D-DCT) features are extracted, and various configurations of MLP and datasets are used to identify optimal results, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.

[1] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[2] W. H. Sumby,et al. Visual contribution to speech intelligibility in noise , 1954 .

[3] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[4] Thierry Baccino,et al. A Computational Cognitive Model of Information Search in Textual Materials , 2012, Cognitive Computation.

[5] Ricardo Gutierrez-Osuna,et al. Audio/visual mapping with cross-modal hidden Markov models , 2005, IEEE Transactions on Multimedia.

[6] Ben P. Milner,et al. Analysing the importance of different visual feature coefficients , 2015, AVSP.

[7] László Tóth. Phone recognition with hierarchical convolutional deep maxout networks , 2015, EURASIP J. Audio Speech Music. Process..

[8] Naomi Harte,et al. Phoneme-to-viseme Mapping for Visual Speech Recognition , 2012, ICPRAM.

[9] Amir Hussain,et al. Novel Two-Stage Audiovisual Speech Filtering in Noisy Environments , 2013, Cognitive Computation.

[10] Andrew Abel,et al. Towards An Intelligent Fuzzy Based Multimodal Two Stage Speech Enhancement System , 2015 .

[11] Barry-John Theobald,et al. Improving visual features for lip-reading , 2010, AVSP.

[12] R. Watt,et al. Biological "bar codes" in human faces. , 2009, Journal of vision.

[13] Igor Farkas,et al. Bidirectional Activation-based Neural Network Learning Algorithm , 2013, ICANN.

[14] Ming-Hsuan Yang,et al. Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[15] Barry-John Theobald,et al. Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading? , 2014, ISVC.

[16] Ben P. Milner,et al. Effective visually-derived Wiener filtering for audio-visual speech processing , 2009, AVSP.

[17] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[18] Richard Harvey,et al. Decoding visemes: Improving machine lip-reading , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Madhurananda Pahar,et al. A novel sound reconstruction technique based on a spike code (event) representation , 2016 .

[20] Amir Hussain,et al. Cognitively Inspired Audiovisual Speech Filtering , 2015, SpringerBriefs in Cognitive Computation.

[21] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[22] King Chung,et al. Challenges and Recent Developments in Hearing Aids: Part I. Speech Understanding in Noise, Microphone Technologies and Noise Reduction Algorithms , 2004, Trends in amplification.