Distinctive Phonetic Features Modeling and Extraction Using Deep Neural Networks

Feature extraction is a critical stage of digital speech processing systems. Quality of features is of great importance to provide a solid foundation upon which the subsequent stages stand. Distinctive phonetic features (DPFs) are one of the most representative features of the speech signals. The significance of DPFs is in their ability to provide abstract description of the places and manners of articulation of the language phonemes. A phoneme’s DPF element reflects unique articulatory information about that phoneme. Therefore, there is a need to discover and investigate each DPF element individually in order to achieve a deeper understanding and to come up with a descriptive model for each one. Such fine-grained modeling will satisfy the uniqueness of each DPF element. In this paper, the problem of DPF modeling and extraction of modern standard Arabic is tackled. Due to the remarkable success of deep neural networks (DNNs) that are initialized using deep belief networks (DBNs) in serving DSP applications and its capability of extracting highly representative features from the raw data, we exploit its modeling power to investigate and model the DPF elements. DNN models are compared with the classical multilayer perceptron (MLP) models. The representativeness of several acoustic cues for different DPF elements was also measured. This paper is based on formalizing DPF modeling problem as a binary classification problem. Because the DPF elements are highly imbalanced data, evaluating the quality of models is a very tricky process. This paper addresses the proper evaluation measures satisfying the imbalanced nature of the DPF elements. After modeling each element individually, the two top-level DPF extractors are designed: MLP- and DNN-based extractors. The results show the quality of DNN models and their superiority over MLPs with accuracies of 89.0% and 86.7%, respectively.

[1]  Tasha Nagamine,et al.  Exploring how deep neural networks form phonemic categories , 2015, INTERSPEECH.

[2]  Geoffrey E. Hinton,et al.  Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Muhammad Ghulam,et al.  Distinctive phonetic feature (DPF) based phone segmentation using hybrid neural networks , 2007, INTERSPEECH.

[4]  Ghada Khattab,et al.  Acoustic cue weighting in the singleton vs geminate contrast in Lebanese Arabic: The case of fricative consonants. , 2015, The Journal of the Acoustical Society of America.

[5]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[6]  Foyzul Hassan,et al.  Phonetic Features enhancement for Bangla automatic speech recognition , 2015, 2015 International Conference on Computer and Information Engineering (ICCIE).

[7]  A. Jongman,et al.  Acoustic characteristics of English fricatives. , 2000, The Journal of the Acoustical Society of America.

[8]  Mansour M. Alghmadi,et al.  KACST Arabic Phonetics Database , 2003 .

[9]  Christian Igel,et al.  An Introduction to Restricted Boltzmann Machines , 2012, CIARP.

[10]  Krzysztof Marasek,et al.  Deep Belief Neural Networks and Bidirectional Long-Short Term Memory Hybrid for Speech Recognition , 2015 .

[11]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[12]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[13]  Takashi Fukuda,et al.  Noise-robust ASR by using distinctive phonetic features approximated with logarithmic normal distribution of HMM , 2003, INTERSPEECH.

[14]  Yousef Ajami Alotaibi,et al.  Review of distinctive phonetic features and the Arabic share in related modern research , 2013 .

[15]  Takashi Fukuda,et al.  Distinctive phonetic feature extraction for robust speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  M. Halle,et al.  Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates , 1961 .

[17]  Tara N. Sainath,et al.  Deep Belief Networks using discriminative features for phone recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Hongwei Ding,et al.  Perception and analysis of Chinese accented German vowels , 2007 .

[19]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[20]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[21]  Ellen Eide Distinctive features for use in an automatic speech recognition system , 2001, INTERSPEECH.

[22]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Lamia Bouafif,et al.  Pitch detection and formant analysis of Arabic speech processing , 2001 .

[24]  Bhuvana Ramabhadran,et al.  Deep belief nets for natural language call-routing , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Mohammad Ali Keyvanrad,et al.  A brief survey on deep belief networks and introducing a new object oriented toolbox ( DeeBNet V 3 . 0 ) , 2016 .

[26]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[27]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Takashi Fukuda,et al.  Canonicalization of feature parameters for automatic speech recognition , 2004, INTERSPEECH.

[30]  Muhammad Ghulam,et al.  Designing multiple distinctive phonetic feature extractors for canonicalization by using clustering technique , 2005, INTERSPEECH.

[31]  Sid-Ahmed Selouani,et al.  A new look at the automatic mapping between Arabic distinctive phonetic features and acoustic cues , 2017, 2017 40th International Conference on Telecommunications and Signal Processing (TSP).

[32]  P. DALKA,et al.  Vowel recognition based on acoustic and visual features , 2006 .

[33]  Tanmay Bhowmik,et al.  Manner of articulation based Bengali phoneme classification , 2018, International Journal of Speech Technology.

[34]  Francisco J. Valverde-Albacete,et al.  100% Classification Accuracy Considered Harmful: The Normalized Information Transfer Factor Explains the Accuracy Paradox , 2014, PloS one.

[35]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.