Unsupervised discovery of phoneme boundaries in multi-speaker continuous speech

Children rapidly learn the inventory of phonemes used in their native tongues. Computational approaches to learning phoneme boundaries from speech data do not yet reach the level of human performance. We present an algorithm that operates on, qualitatively, similar data to those children receive: natural language utterances from multiple speakers. Our algorithm is unsupervised and discovers phoneme boundary positions in speech. The approach draws inspiration from the word and text segmentation literature. To demonstrate the efficacy of our algorithm on speech data, we present empirical results of our method using the TIMIT data set. Our method achieves F-measure scores in the 0.68 – 0.73 range for locating phoneme boundary positions.

[1]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[2]  T. Armstrong,et al.  RIPTIDE: Segmenting data using multiple resolutions , 2007, 2007 IEEE 6th International Conference on Development and Learning.

[3]  Alexander Stoytchev,et al.  An unsupervised model of infant acoustic speech segmentation , 2009, EpiRob.

[4]  Paul R. Cohen,et al.  Unsupervised segmentation of categorical time series into episodes , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[5]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[6]  Tomi Kinnunen,et al.  Is speech data clustered? - statistical analysis of cepstral features , 2001, INTERSPEECH.

[7]  Odette Scharenborg,et al.  Segmentation of speech: child's play? , 2007, INTERSPEECH.

[8]  J. Wolff AN ALGORITHM FOR THE SEGMENTATION OF AN ARTIFICIAL LANGUAGE ANALOGUE , 1975 .

[9]  Anna Esposito,et al.  A new text-independent method for phoneme segmentation , 2001, Proceedings of the 44th IEEE 2001 Midwest Symposium on Circuits and Systems. MWSCAS 2001 (Cat. No.01CH37257).

[10]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[11]  Xavier Rodet,et al.  Automatic Phoneme Segmentation with Relaxed Textual Constraints , 2008, LREC.

[12]  James Hammerton Learning to Segment Speech with Self-Organising Maps , 2002, CLIN.

[13]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[14]  Eve V. Clark,et al.  First Language Acquisition , 2002, The Study of Language.

[15]  Brian Scassellati,et al.  Audio Speech Segmentation Without Language-Specific Knowledge , 2006 .

[16]  Tom Armstrong,et al.  UNDERTOW: Multi-Level Segmentation of Real-Valued Time Series , 2007, AAAI.

[17]  Nobuaki Minematsu,et al.  Unsupervised optimal phoneme segmentation: Objectives, algorithm and comparisons , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[19]  Peter W. Jusczyk,et al.  Investigations of the word segmentation abilities of infants , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Eleanor Olds Batchelder,et al.  Bootstrapping the lexicon: A computational model of infant speech segmentation , 2002, Cognition.