Age-VOX-Celeb: Multi-Modal Corpus for Facial and Speech Estimation

Estimating a speaker’s age from their speech is more challenging than age estimation from their face because of insufficiently available public corpora. To tackle this problem, we construct a new audio-visual age corpus named AgeVoxCeleb by annotating age labels to VoxCeleb2 videos. AgeVoxCeleb is the first large-scale, balanced, and multi-modal age corpus that contains both video and speech of the same speakers from a wide age range. Using AgeVox-Celeb, our paper makes the following contributions: (i) A facial age estimation model can outperform a speech age estimation model by comparing the state-of-the-art models in each task. (ii) Facial age estimation is more robust against the difference between training and test sets. (iii) We developed cross-modal transfer learning from face to speech age estimation, showing that the estimated age with a facial age estimation model can be used to train a speech age estimation model. Proposed AgeVoxCeleb will be published in https://github.com/nttcslab-sp/agevoxceleb.

[1]  Hugo Van hamme,et al.  Age Estimation from Telephone Speech using i-vectors , 2012, INTERSPEECH.

[2]  Luc Van Gool,et al.  Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks , 2016, International Journal of Computer Vision.

[3]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[4]  Sanjeev Khudanpur,et al.  End-to-end Deep Neural Network Age Estimation , 2018, INTERSPEECH.

[5]  Takeshi Mori,et al.  Speaker Age Estimation Using Age-Dependent Insensitive Loss , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[6]  Tsuhan Chen,et al.  Understanding images of groups of people , 2009, CVPR.

[7]  Deepu Vijayasenan,et al.  A Deep Neural Network Based End to End Model for Joint Height and Age Estimation from Short Duration Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Najim Dehak,et al.  Age Estimation in Short Speech Utterances Based on LSTM Recurrent Neural Networks , 2018, IEEE Access.

[9]  Tao Xiang,et al.  Interestingness Prediction by Robust Learning to Rank , 2014, ECCV.

[10]  Luc Van Gool,et al.  DEX: Deep EXpectation of Apparent Age from a Single Image , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[11]  Arna Fariza,et al.  Age Estimation System Using Deep Residual Network Classification Method , 2019, 2019 International Electronics Symposium (IES).

[12]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[13]  Klemen Grm,et al.  Analysis of Race and Gender Bias in Deep Age Estimation Models , 2021, 2020 28th European Signal Processing Conference (EUSIPCO).

[14]  Karl Ricanek,et al.  MORPH: a longitudinal image database of normal adult age-progression , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[15]  Yu Zhang,et al.  Learning from facial aging patterns for automatic age estimation , 2006, MM '06.

[16]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[17]  Saeid Safavi,et al.  Identification of age-group from children's speech by computers and humans , 2014, INTERSPEECH.

[18]  Chu-Song Chen,et al.  Cross-Age Reference Coding for Age-Invariant Face Recognition and Retrieval , 2014, ECCV.

[19]  Constantine Kotropoulos,et al.  Multi-way regression for age prediction exploiting speech and face image information , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[20]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[21]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Pascale Fung,et al.  HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus , 2006, ISCSLP.

[24]  Anil K. Jain,et al.  Age estimation from face images: Human vs. machine performance , 2013, 2013 International Conference on Biometrics (ICB).

[25]  Tal Hassner,et al.  Age and gender classification using convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[26]  Seyed Omid Sadjadi,et al.  Speaker age estimation on conversational telephone speech using senone posterior based i-vectors , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Davis E. King Max-Margin Object Detection , 2015, ArXiv.

[28]  Keikichi Hirose,et al.  Automatic estimation of one's age with his/her speech based upon acoustic modeling techniques of speakers , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Jules-Raymond Tapamo,et al.  Age estimation via face images: a survey , 2018, EURASIP Journal on Image and Video Processing.

[30]  Xavier Baró,et al.  Apparent and Real Age Estimation in Still Images with Deep Residual Regressors on Appa-Real Database , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).