Text-Independent Phoneme Segmentation Combining EGG and Speech Data

A new approach for text-independent phoneme segmentation at sampling point level is proposed in this paper. The algorithm consists of two phases: First, the voiced sections in speech data are detected using the information of vocal folds vibration contained in electroglottograph (EGG). A Hilbert envelope feature is adopted to achieve sampling point level detection accuracy. Second, the voiced sections and other sections are treated separately. Each voiced section is divided into several candidate phonemes using the Viterbi algorithm. Then adjacent candidate phonemes are merged based on a Hotellings T-square test method. For other sections, the unvoiced consonants are detected from silence based on a singularity exponent feature. Comparison experiments show that the proposed method has better performance than the existing ones for a variety of tolerances, and is more robust to noise.

[1]  Harald Romsdorfer,et al.  Phonetic labeling and segmentation of mixed-lingual prosody databases , 2005, INTERSPEECH.

[2]  D. Harte Multifractals: Theory and Applications , 2001 .

[3]  A. Juneja,et al.  Speech segmentation using probabilistic phonetic feature hierarchy and support vector machines , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[4]  Nobuaki Minematsu,et al.  Unsupervised optimal phoneme segmentation: Objectives, algorithm and comparisons , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  张国亮,et al.  Comparison of Different Implementations of MFCC , 2001 .

[6]  Nobuaki Minematsu,et al.  Unsupervised optimal phoneme segmentation: theory and experimental evaluation , 2013, IET Signal Process..

[7]  Zhaoyan Zhang,et al.  Coherent structures of the near field flow in a self-oscillating physical model of the vocal folds. , 2007, The Journal of the Acoustical Society of America.

[8]  Khalid Daoudi,et al.  Improving text-independent phonetic segmentation based on the Microcanonical Multiscale Formalism , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Shinji Maeda,et al.  Fundamental frequency histograms measured by electroglottography during speech: a pilot study for standardization. , 2006, Journal of voice : official journal of the Voice Foundation.

[10]  Kishore Prahallad,et al.  Sub-Phonetic Modeling For Capturing Pronunciation Variations For Conversational Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Andreas Stolcke,et al.  Automatic phonetic segmentation using boundary models , 2013, INTERSPEECH.

[12]  Anna Esposito,et al.  A new text-independent method for phoneme segmentation , 2001, Proceedings of the 44th IEEE 2001 Midwest Symposium on Circuits and Systems. MWSCAS 2001 (Cat. No.01CH37257).

[13]  Khalid Daoudi,et al.  Phonetic segmentation of speech signal using local singularity analysis , 2014, Digit. Signal Process..

[14]  Constantine Kotropoulos,et al.  Phonemic segmentation using the generalised Gamma distribution and small sample Bayesian information criterion , 2008, Speech Commun..

[15]  Mitsuru Ishizuka,et al.  Mandarin emotion recognition combining acoustic and emotional point information , 2012, Applied Intelligence.

[16]  Beat Pfister,et al.  Fully automatic segmentation for prosodic speech corpora , 2010, INTERSPEECH.

[17]  Zheng Fang,et al.  Comparison of different implementations of MFCC , 2001 .

[18]  Longbiao Wang,et al.  Speaker recognition by combining MFCC and phase information , 2010, INTERSPEECH.

[19]  Lijiang Chen,et al.  Speech Emotional Features Extraction Based on Electroglottograph , 2013, Neural Computation.

[20]  Jr. G. Forney,et al.  Viterbi Algorithm , 1973, Encyclopedia of Machine Learning.

[21]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[22]  C. Dromey,et al.  Approximations of open quotient and speed quotient from glottal airflow and EGG waveforms: effects of measurement criteria and sound pressure level. , 1998, Journal of voice : official journal of the Voice Foundation.

[23]  Soo Ngee Koh,et al.  A hybrid refinement scheme for intra- and cross-corpora phonetic segmentation , 2015, Comput. Speech Lang..

[24]  N. Huang,et al.  The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis , 1998, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[25]  Mark J. F. Gales,et al.  Automatic transcription of Broadcast News , 2002, Speech Commun..

[26]  Antonio Turiel,et al.  Numerical methods for the estimation of multifractal singularity spectra on sampled data: A comparative study , 2006, J. Comput. Phys..

[27]  S. R. Mahadeva Prasanna,et al.  Analysis of excitation source information in emotional speech , 2010, INTERSPEECH.

[28]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[29]  Johan Sundberg,et al.  Simultaneous analysis of vocal fold vibration and transglottal airflow: exploring a new experimental setup. , 2003, Journal of voice : official journal of the Voice Foundation.

[30]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[31]  Luis A. Hernández Gómez,et al.  Automatic phonetic segmentation , 2003, IEEE Trans. Speech Audio Process..

[32]  Jordi Adell,et al.  Towards phone segmentation for concatenative speech synthesis , 2004, SSW.

[33]  Odette Scharenborg,et al.  Unsupervised speech segmentation: an analysis of the hypothesized phone boundaries. , 2010, The Journal of the Acoustical Society of America.