Probabilistic Modeling of Speech in Spectral Domain using Maximum Likelihood Estimation

The performance of many speech processing algorithms depends on modeling speech signals using appropriate probability distributions. Various distributions such as the Gamma distribution, Gaussian distribution, Generalized Gaussian distribution, Laplace distribution as well as multivariate Gaussian and Laplace distributions have been proposed in the literature to model different segment lengths of speech, typically below 200 ms in different domains. In this paper, we attempted to fit Laplace and Gaussian distributions to obtain a statistical model of speech short-time Fourier transform coefficients with high spectral resolution (segment length >500 ms) and low spectral resolution (segment length <10 ms). Distribution fitting of Laplace and Gaussian distributions was performed using maximum-likelihood estimation. It was found that speech short-time Fourier transform coefficients with high spectral resolution can be modeled using Laplace distribution. For low spectral resolution, neither the Laplace nor Gaussian distribution provided a good fit. Spectral domain modeling of speech with different depths of spectral resolution is useful in understanding the perceptual stability of hearing which is necessary for the design of digital hearing aids.

[1]  Shashidhar G. Koolagudi,et al.  Identification of Language using Mel-Frequency Cepstral Coefficients (MFCC) , 2012 .

[2]  Tom Bäckström,et al.  Speech Coding: with Code-Excited Linear Prediction , 2017 .

[3]  Werner Hemmert,et al.  Automatic speech recognition with an adaptation model motivated by auditory processing , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Anders Hald,et al.  On the history of maximum likelihood in relation to inverse probability and least squares , 1999 .

[5]  Rainer Martin,et al.  SPEECH ENHANCEMENT IN THE DFT DOMAIN USING LAPLACIAN SPEECH PRIORS , 2003 .

[6]  Denys Katerenchuk Age Group Classification with Speech and Metadata Multimodality Fusion , 2017, EACL.

[7]  Q J Fu,et al.  Effects of noise and spectral resolution on vowel and consonant recognition: acoustic and electric hearing. , 1998, The Journal of the Acoustical Society of America.

[8]  Miroslav Voznak,et al.  Fundamental Frequency Extraction Method using Central Clipping and its Importance for the Classification of Emotional State , 2012 .

[9]  Deniz Başkent,et al.  Pitch and spectral resolution: A systematic comparison of bottom-up cues for top-down repair of degraded speech. , 2016, The Journal of the Acoustical Society of America.

[10]  Joon-Hyuk Chang,et al.  Speech probability distribution based on generalized gama distribution , 2004, INTERSPEECH.

[11]  Teddy Surya Gunawan,et al.  Development of language identification system using MFCC and vector quantization , 2017, 2017 IEEE 4th International Conference on Smart Instrumentation, Measurement and Application (ICSIMA).

[12]  Rainer Martin,et al.  Speech enhancement using MMSE short time spectral estimation with gamma distributed speech priors , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Tomas Bäckström,et al.  Estimation of the Probability Distribution of Spectral Fine Structure in the Speech Source , 2017, INTERSPEECH.

[14]  Richard Heusdens,et al.  A STUDY OF THE DISTRIBUTION OF TIME-DOMAIN SPEECH SAMPLES AND DISCRETE FOURIER COEFFICIENTS , 2005 .

[15]  Ina Kodrasi,et al.  Statistical Modeling of Speech Spectral Coefficients in Patients with Parkinson's Disease , 2018, ITG Symposium on Speech Communication.

[16]  Fang Chen,et al.  Combining Cepstral and Prosodic Features in Language Identification , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[17]  D. L. Richards,et al.  Statistical properties of speech signals , 1964 .

[18]  S. Gazor,et al.  Probability distribution of speech signal spectral envelope , 2004, Canadian Conference on Electrical and Computer Engineering 2004 (IEEE Cat. No.04CH37513).

[19]  Muhammad Ghulam,et al.  Comparison of voice features for Arabic speech recognition , 2011, 2011 Sixth International Conference on Digital Information Management.

[20]  Saeed Gazor,et al.  An adaptive KLT approach for speech enhancement , 2001, IEEE Trans. Speech Audio Process..

[21]  Tadanobu Misawa,et al.  Noise reduction for periodic signals using high-resolution frequency analysis , 2011 .

[22]  R. M. Norton,et al.  The Double Exponential Distribution: Using Calculus to Find a Maximum Likelihood Estimator , 1984 .

[23]  M.M. Homayounpour,et al.  Speaker age interval and sex identification based on Jitters, Shimmers and Mean MFCC using supervised and unsupervised discriminative classification methods , 2006, 2006 8th international Conference on Signal Processing.

[24]  Tobias Herbig,et al.  Detection of Voiced Speech and Pitch Estimation for Applications with Low Spectral Resolution , 2017 .

[25]  S. Gazor,et al.  Speech probability distribution , 2003, IEEE Signal Processing Letters.

[26]  Fan-Gang Zeng,et al.  Cochlear Implants: System Design, Integration, and Evaluation , 2008, IEEE Reviews in Biomedical Engineering.

[27]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  S. M. Sameer,et al.  Cramer-Rao bound for joint estimation problems , 2013 .

[29]  Steven Greenberg,et al.  The modulation spectrogram: in pursuit of an invariant representation of speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Zheng-Hua Tan,et al.  Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection , 2010, IEEE Journal of Selected Topics in Signal Processing.

[31]  James J. Filliben,et al.  NIST/SEMATECH e-Handbook of Statistical Methods; Chapter 1: Exploratory Data Analysis , 2003 .

[32]  Joon-Hyuk Chang,et al.  Statistical modeling of speech signals based on generalized gamma distribution , 2005, IEEE Signal Process. Lett..