Speech2Face: Learning the Face Behind a Voice
暂无分享,去创建一个
Tae-Hyun Oh | Wojciech Matusik | William T. Freeman | Tali Dekel | Changil Kim | Inbar Mosseri | Michael Rubinstein | W. Freeman | W. Matusik | Michael Rubinstein | Tae-Hyun Oh | Tali Dekel | Changil Kim | Inbar Mosseri
[1] Tae-Hyun Oh,et al. On Learning Associations of Faces and Voices , 2018, ACCV.
[2] Najim Dehak,et al. Age Estimation in Short Speech Utterances Based on LSTM Recurrent Neural Networks , 2018, IEEE Access.
[3] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.
[4] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[5] V. D. Sa. Minimizing Disagreement for Self-Supervised Classification , 2022 .
[6] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[7] Christian A. Müller,et al. Automatic speaker age and gender recognition in the car for tailoring dialog and mobile services , 2010, INTERSPEECH.
[8] P. Denes. The Speech Chain , 1963 .
[9] Ira Kemelmacher-Shlizerman,et al. Synthesizing Obama , 2017, ACM Trans. Graph..
[10] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[11] John H L Hansen,et al. Speaker height estimation from speech: Fusing spectral regression and statistical acoustic models. , 2015, The Journal of the Acoustical Society of America.
[12] Kevin Wilson,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..
[13] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.
[14] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[15] Yu Zhang,et al. Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.
[16] Andrew Zisserman,et al. Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[17] Radu Horaud,et al. Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).
[18] Andrew Zisserman,et al. Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.
[19] Andrew Zisserman,et al. X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.
[20] E. Vatikiotis-Bateson,et al. `Putting the Face to the Voice' Matching Identity across Modality , 2003, Current Biology.
[21] Jordi Torres,et al. Wav2Pix: Speech-conditioned Face Generation Using Generative Adversarial Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[22] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[23] Phillip Isola. The Discovery of perceptual structure from visual co-occurrences in space and time , 2015 .
[24] Jean Charles Bazin,et al. Suggesting Sounds for Images from Video Collections , 2016, ECCV Workshops.
[25] Andrew Zisserman,et al. Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.
[26] Tae-Hyun Oh,et al. Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[27] Jaakko Lehtinen,et al. Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..
[28] Andrew Owens,et al. Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning , 2017, International Journal of Computer Vision.
[29] Honglak Lee,et al. Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.
[30] Ming-Yu Liu,et al. Coupled Generative Adversarial Networks , 2016, NIPS.
[31] William T. Freeman,et al. Synthesizing Normalized Faces from Facial Identity Features , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Malcolm Slaney,et al. Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers , 2017, ArXiv.
[33] Bhiksha Raj,et al. Disjoint Mapping Network for Cross-modal Matching of Voices and Faces , 2018, ICLR.
[34] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[35] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Naoyuki Kanda,et al. Face-Voice Matching using Cross-modal Embeddings , 2018, ACM Multimedia.
[37] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.
[38] Andrew Zisserman,et al. Deep Face Recognition , 2015, BMVC.
[39] Davis E. King,et al. Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..
[40] H. M. J. Smith,et al. Matching novel face and voice identity using static and dynamic facial images , 2016, Attention, perception & psychophysics.
[41] John R. Smith,et al. Diversity in Faces , 2019, ArXiv.
[42] Tae-Hyun Oh,et al. Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[43] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.
[44] Carlos Busso,et al. Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks , 2018, IEEE Transactions on Affective Computing.
[45] Ira Kemelmacher-Shlizerman,et al. Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[46] Antonio Torralba,et al. Learning Aligned Cross-Modal Representations from Weakly Aligned Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Yisong Yue,et al. A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..
[48] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[49] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .
[50] Hugo Van hamme,et al. Speaker age estimation and gender detection based on supervised Non-Negative Matrix Factorization , 2011, 2011 IEEE Workshop on Biometric Measurements and Systems for Security and Medical Applications (BIOMS).
[51] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.