TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis
暂无分享,去创建一个
Zhen-Hua Ling | Korin Richmond | Lirong Dai | Jing-Xuan Zhang | Lirong Dai | Zhenhua Ling | Korin Richmond | Jing-Xuan Zhang
[1] Thomas Hueber,et al. Statistical conversion of silent articulation into audible speech using full-covariance HMM , 2016, Comput. Speech Lang..
[2] Christopher T Kello,et al. A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. , 2004, The Journal of the Acoustical Society of America.
[3] Gábor Gosztolya,et al. Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces , 2018, INTERSPEECH.
[4] Gábor Gosztolya,et al. DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface , 2017, INTERSPEECH.
[5] Liangliang Cao,et al. Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[6] Shmuel Peleg,et al. Vid2speech: Speech reconstruction from silent video , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[7] Gérard Chollet,et al. Continuous-speech phone recognition from ultrasound and optical images of the tongue and lips , 2007, INTERSPEECH.
[8] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.
[9] Joon Son Chung,et al. Deep Lip Reading: a comparison of models and an online application , 2018, INTERSPEECH.
[10] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.
[11] Bruce Denby,et al. Updating the silent speech challenge benchmark with deep learning , 2017, Speech Commun..
[12] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..
[13] Edward F. Chang,et al. Speech synthesis from neural decoding of spoken sentences , 2019, Nature.
[14] P. Schönle,et al. Electromagnetic articulography: Use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract , 1987, Brain and Language.
[15] Bruce Denby,et al. Comparison of DCT and autoencoder-based features for DNN-HMM multimodal silent speech recognition , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).
[16] J. M. Gilbert,et al. Silent speech interfaces , 2010, Speech Commun..
[17] Tokihiko Kaburagi,et al. Articulatory-to-speech Conversion Using Bi-directional Long Short-term Memory , 2018, INTERSPEECH.
[18] J C Gore,et al. Application of MRI to the analysis of speech production. , 1987, Magnetic resonance imaging.
[19] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[20] Shimon Whiteson,et al. LipNet: Sentence-level Lipreading , 2016, ArXiv.
[21] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.
[22] Sorin Dusan,et al. Speech interfaces based upon surface electromyography , 2010, Speech Commun..
[23] Jürgen Schmidhuber,et al. Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[24] Steve Renals,et al. Speaker-independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[25] Thomas Hueber,et al. Feature extraction using multimodal convolutional neural networks for visual speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[26] Ryuichi Yamamoto,et al. Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[27] Rohit Jain,et al. Lipper: Synthesizing Thy Speech using Multi-View Lipreading , 2019, AAAI.
[28] Tamás Gábor Csapó,et al. Convolutional neural network-based automatic classification of midsagittal tongue gestural targets using B-mode ultrasound images. , 2017, The Journal of the Acoustical Society of America.
[29] Keiichi Tokuda,et al. Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..
[30] Ren-Hua Wang,et al. Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.
[31] Bruce Denby,et al. Speech synthesis from real time ultrasound images of the tongue , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[32] L. Auger. The Journal of the Acoustical Society of America , 1949 .
[33] Li-Rong Dai,et al. Articulatory-to-acoustic conversion using BLSTM-RNNs with augmented input representation , 2018, Speech Commun..
[34] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[35] Gábor Gosztolya,et al. F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[36] Sercan Ömer Arik,et al. Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.
[37] Shujie Liu,et al. Neural Speech Synthesis with Transformer Network , 2018, AAAI.
[38] Gérard Chollet,et al. Statistical Mapping Between Articulatory and Acoustic Data for an Ultrasound-Based Silent Speech Interface , 2011, INTERSPEECH.
[39] Ricardo Gutierrez-Osuna,et al. Data driven articulatory synthesis with deep neural networks , 2016, Comput. Speech Lang..
[40] Erich Elsen,et al. Efficient Neural Audio Synthesis , 2018, ICML.
[41] C. V. Jawahar,et al. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Gérard Chollet,et al. Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.
[43] Shigeru Kiritani,et al. X-ray microbeam method for measurement of articulatory dynamics-techniques and results , 1986, Speech Commun..
[44] Shmuel Peleg,et al. Improved Speech Reconstruction from Silent Video , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).
[45] Jun Rekimoto,et al. SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks , 2019, CHI.