Text-Independent Phoneme Segmentation via Learning Critical Acoustic Change Points

The conventional methods of automatic text-independent phoneme segmentation detect phoneme boundaries via calculating the acoustic changes along speech signals followed by a peak picking procedure according to user-defined rules. Instead, this paper presents a learning-based method in which the phoneme boundaries are viewed as critical points in the acoustic change context of speech signals. First, we adopt a metric learning procedure in the calculation of acoustic changes, in order to make the acoustic changes at phoneme boundaries more discriminative. Then, latent-dynamic conditional random field is used to model the acoustic change context of speech signals for the detection of phoneme boundaries. The experiments demonstrate that our method outperforms the rule-based methods reported in previous work.

[1]  P. Kuhl Early language acquisition: cracking the speech code , 2004, Nature Reviews Neuroscience.

[2]  Constantine Kotropoulos,et al.  Robust Detection of Phone Boundaries Using Model Selection Criteria With Few Observations , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Nobuaki Minematsu,et al.  Unsupervised optimal phoneme segmentation: Objectives, algorithm and comparisons , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Biing-Hwang Juang,et al.  An overview on automatic speech attribute transcription (ASAT) , 2007, INTERSPEECH.

[5]  Lawrence R. Rabiner,et al.  On the Relation between Maximum Spectra Boundaries , 2006 .

[6]  Anna Esposito,et al.  A new text-independent method for phoneme segmentation , 2001, Proceedings of the 44th IEEE 2001 Midwest Symposium on Circuits and Systems. MWSCAS 2001 (Cat. No.01CH37257).

[7]  Odette Scharenborg,et al.  Segmentation of speech: child's play? , 2007, INTERSPEECH.

[8]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[9]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[10]  Odette Scharenborg,et al.  Finding Maximum Margin Segments in Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12]  Odette Scharenborg,et al.  Unsupervised speech segmentation: an analysis of the hypothesized phone boundaries. , 2010, The Journal of the Acoustical Society of America.

[13]  Khalid Daoudi,et al.  Improving text-independent phonetic segmentation based on the Microcanonical Multiscale Formalism , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[16]  Nobuaki Minematsu,et al.  Metric learning for unsupervised phoneme segmentation , 2008, INTERSPEECH.