A Data Driven Approach to Audiovisual Speech Mapping

The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual (2D-DCT) features are extracted, and various configurations of MLP and datasets are used to identify optimal results, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.

[1]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[2]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[3]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[4]  Thierry Baccino,et al.  A Computational Cognitive Model of Information Search in Textual Materials , 2012, Cognitive Computation.

[5]  Ricardo Gutierrez-Osuna,et al.  Audio/visual mapping with cross-modal hidden Markov models , 2005, IEEE Transactions on Multimedia.

[6]  Ben P. Milner,et al.  Analysing the importance of different visual feature coefficients , 2015, AVSP.

[7]  László Tóth Phone recognition with hierarchical convolutional deep maxout networks , 2015, EURASIP J. Audio Speech Music. Process..

[8]  Naomi Harte,et al.  Phoneme-to-viseme Mapping for Visual Speech Recognition , 2012, ICPRAM.

[9]  Amir Hussain,et al.  Novel Two-Stage Audiovisual Speech Filtering in Noisy Environments , 2013, Cognitive Computation.

[10]  Andrew Abel,et al.  Towards An Intelligent Fuzzy Based Multimodal Two Stage Speech Enhancement System , 2015 .

[11]  Barry-John Theobald,et al.  Improving visual features for lip-reading , 2010, AVSP.

[12]  R. Watt,et al.  Biological "bar codes" in human faces. , 2009, Journal of vision.

[13]  Igor Farkas,et al.  Bidirectional Activation-based Neural Network Learning Algorithm , 2013, ICANN.

[14]  Ming-Hsuan Yang,et al.  Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[15]  Barry-John Theobald,et al.  Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading? , 2014, ISVC.

[16]  Ben P. Milner,et al.  Effective visually-derived Wiener filtering for audio-visual speech processing , 2009, AVSP.

[17]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[18]  Richard Harvey,et al.  Decoding visemes: Improving machine lip-reading , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Madhurananda Pahar,et al.  A novel sound reconstruction technique based on a spike code (event) representation , 2016 .

[20]  Amir Hussain,et al.  Cognitively Inspired Audiovisual Speech Filtering , 2015, SpringerBriefs in Cognitive Computation.

[21]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[22]  King Chung,et al.  Challenges and Recent Developments in Hearing Aids: Part I. Speech Understanding in Noise, Microphone Technologies and Noise Reduction Algorithms , 2004, Trends in amplification.