Automated Classification of Vowel Category and Speaker Type in the High-Frequency Spectrum

The high-frequency region of vowel signals (above the third formant or F3) has received little research attention. Recent evidence, however, has documented the perceptual utility of high-frequency information in the speech signal above the traditional frequency bandwidth known to contain important cues for speech and speaker recognition. The purpose of this study was to determine if high-pass filtered vowels could be separated by vowel category and speaker type in a supervised learning framework. Mel frequency cepstral coefficients (MFCCs) were extracted from productions of six vowel categories produced by two male, two female, and two child speakers. Results revealed that the filtered vowels were well separated by vowel category and speaker type using MFCCs from the high-frequency spectrum. This demonstrates the presence of useful information for automated classification from the high-frequency region and is the first study to report findings of this nature in a supervised learning framework.

[1]  Frederick J. Gallun,et al.  Vowel Identification by Amplitude and Phase Contrast , 2013, Journal of the Association for Research in Otolaryngology.

[2]  Raghunath S. Holambe,et al.  Robust speaker identification in the presence of car noise , 2011, Int. J. Biom..

[3]  Jeremy J. Donai,et al.  Identification of high-pass filtered male, female, and child vowels: The use of high-frequency cues. , 2015, The Journal of the Acoustical Society of America.

[4]  Alexander J. Smola,et al.  Binet-Cauchy Kernels on Dynamical Systems and its Application to the Analysis of Dynamic Scenes , 2007, International Journal of Computer Vision.

[5]  Fumitada Itakura,et al.  The influence of noise on the speaker recognition performance using the higher frequency band , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Jeremy J. Donai,et al.  Gender identification from high-pass filtered vowel segments: The use of high-frequency energy , 2015, Attention, Perception, & Psychophysics.

[7]  Andrew J Lotto,et al.  Phoneme categorization relying solely on high-frequency energy. , 2015, The Journal of the Acoustical Society of America.

[8]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[9]  M. P. Gelfer,et al.  The relative contributions of speaking fundamental frequency and formant frequencies to gender identification based on isolated vowels. , 2005, Journal of voice : official journal of the Voice Foundation.

[10]  Tomi Kinnunen,et al.  Is speech data clustered? - statistical analysis of cepstral features , 2001, INTERSPEECH.

[11]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[12]  Urmila Shrawankar,et al.  Automatic Speech Recognition Using Template Model for Man-Machine Interface , 2013, ArXiv.

[13]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[14]  Brad H. Story,et al.  Gender and vocal production mode discrimination using the high frequencies for speech and singing , 2014, Front. Psychol..

[15]  Gregory D. Hager,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, CVPR.