SPARTA: Speaker Profiling for ARabic TAlk

This paper proposes a novel approach to an automatic estimation of three speaker traits from Arabic speech: gender, emotion, and dialect. After showing promising results on different text classification tasks, the multi-task learning (MTL) approach is used in this paper for Arabic speech classification tasks. The dataset was assembled from six publicly available datasets. First, The datasets were edited and thoroughly divided into train, development, and test sets (open to the public), and a benchmark was set for each task and dataset throughout the paper. Then, three different networks were explored: Long Short Term Memory (LSTM), Convolutional Neural Network (CNN), and Fully-Connected Neural Network (FCNN) on five different types of features: two raw features (MFCC and MEL) and three pre-trained vectors (i-vectors, d-vectors, and x-vectors). LSTM and CNN networks were implemented using raw features: MFCC and MEL, where FCNN was explored on the pre-trained vectors while varying the hyper-parameters of these networks to obtain the best results for each dataset and task. MTL was evaluated against the single task learning (STL) approach for the three tasks and six datasets, in which the MTL and pre-trained vectors almost constantly outperformed STL. All the data and pre-trained models used in this paper are available and can be acquired by the public.

[1]  Zhen Yang,et al.  Multi-dimensional Speaker Information Recognition with Multi-task Neural Network , 2018, 2018 IEEE 4th International Conference on Computer and Communications (ICCC).

[2]  Constantine Kotropoulos,et al.  Automatic speech classification to five emotional states based on gender information , 2004, 2004 12th European Signal Processing Conference.

[3]  Yusuke Shinohara,et al.  Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition , 2016, INTERSPEECH.

[4]  K. Almeman,et al.  Multi dialect Arabic speech parallel corpora , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).

[5]  Mihir Narayan Mohanty,et al.  Efficient feature combination techniques for emotional speech classification , 2016, Int. J. Speech Technol..

[6]  Yu Zhang,et al.  A Survey on Multi-Task Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[7]  Stephan Vogel,et al.  Speech recognition challenge in the wild: Arabic MGB-3 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8]  A Review on Speech Recognition Challenges and Approaches , 2012 .

[9]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[10]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Patrick Kenny,et al.  An i-vector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech , 2010, Odyssey.

[12]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[14]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[15]  Mansour Alsulaiman,et al.  KSU rich Arabic speech database , 2013 .

[16]  Andreas Stolcke,et al.  Effective Arabic Dialect Classification Using Diverse Phonotactic Models , 2011, INTERSPEECH.

[17]  Ziad Osman,et al.  Arabic Natural Audio Dataset , 2018 .

[18]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech: a review , 2012, International Journal of Speech Technology.

[19]  Wei Li,et al.  An improved i-vector extraction algorithm for speaker verification , 2015, EURASIP J. Audio Speech Music. Process..

[20]  Ziad Osman,et al.  Emotion recognition in Arabic speech , 2017, 2017 Sensors Networks Smart and Emerging Technologies (SENSET).

[21]  Sid-Ahmed Selouani,et al.  Evaluation of an Arabic Speech Corpus of Emotions: A Perceptual and Statistical Analysis , 2018, IEEE Access.

[22]  Mohamad Hasan Bahari,et al.  Multitask speaker profiling for estimating age, height, weight and smoking habits from spontaneous telephone speech signals , 2014, 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE).

[23]  Nidhi Desai,et al.  Feature Extraction and Classification Techniques for Speech Recognition: A Review , 2013 .

[24]  P. Ekman,et al.  What the face reveals : basic and applied studies of spontaneous expression using the facial action coding system (FACS) , 2005 .

[25]  Grgoire Montavon,et al.  Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[26]  James R. Glass,et al.  Automatic Dialect Detection in Arabic Broadcast Speech , 2015, INTERSPEECH.

[27]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[28]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.