Unsupervised word discovery from speech using automatic segmentation into syllable-like units

This paper presents a syllable-based approach to unsupervised pattern discovery from speech. By first segmenting speech into syllable-like units, the system is able to limit potential word onsets and offsets to a finite number of candidate locations. These syllable tokens are then described using a set of features and clustered into a finite number of syllable classes. Finally, recurring syllable sequences or individual classes are treated as word candidates. Feasibility of the approach is investigated on spontaneous American English and Tsonga language samples with promising results. We also present a new and simple, oscillator-based algorithm for efficient unsupervised syllabic segmentation.

[1]  Naomi Feldman,et al.  Weak semantic context helps phonetic learning in a model of infant language acquisition , 2014, ACL.

[2]  Jacques Mehler,et al.  The Role of Syllables in Speech Processing: Infant and Adult Data [and Discussion] , 1981 .

[3]  Rudi C. Villing,et al.  Automatic Blind Syllable Segmentation for Continuous Speech , 2004 .

[4]  Herbert Gish,et al.  Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery , 2014, Comput. Speech Lang..

[5]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[6]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[7]  T. A. Cartwright,et al.  Distributional regularity and phonotactic constraints are useful for segmentation , 1996, Cognition.

[8]  Denis Burnham,et al.  Infant-directed speech enhances temporal rhythmic structure in the envelope , 2014, INTERSPEECH.

[9]  Frédéric Bimbot,et al.  Unsupervised Motif Acquisition in Speech via Seeded Discovery and Template Matching Combination , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Abdellah Fourtassi,et al.  A Rudimentary Lexicon and Semantics Help Bootstrap Phoneme Acquisition , 2014, CoNLL.

[11]  Alta de Waal,et al.  A smartphone-based ASR data collection tool for under-resourced languages , 2014, Speech Commun..

[12]  Oded Ghitza,et al.  Linking Speech Perception and Neurophysiology: Speech Decoding Guided by Cascaded Oscillators Locked to the Input Rhythm , 2011, Front. Psychology.

[13]  Tim Oates,et al.  PERUSE: An unsupervised algorithm for finding recurring patterns in time series , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[14]  P. Eimas Segmental and syllabic representations in the perception of speech by young infants. , 1999, The Journal of the Acoustical Society of America.

[15]  James R. Glass Towards unsupervised speech processing , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[16]  Paul Mermelstein,et al.  Experiments in syllable-based recognition of continuous speech , 1980, ICASSP.

[17]  David Poeppel,et al.  Cortical oscillations and speech processing: emerging computational principles and operations , 2012, Nature Neuroscience.

[18]  P. Kuhl Early language acquisition: cracking the speech code , 2004, Nature Reviews Neuroscience.

[19]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Guozhen An,et al.  Detecting laughter and filled pauses using syllable-based features , 2013, INTERSPEECH.

[21]  Bogdan Ludusan,et al.  Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems , 2014, LREC.

[22]  Okko Räsänen,et al.  Time-frequency integration characteristics of hearing are optimized for perception of speech-like acoustic patterns. , 2013, The Journal of the Acoustical Society of America.

[23]  Joseph Picone,et al.  Syllable-based large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[24]  Andrew Rosenberg,et al.  AutoBI - a tool for automatic toBI annotation , 2010, INTERSPEECH.

[25]  Mary R. Newsome,et al.  The Beginnings of Word Segmentation in English-Learning Infants , 1999, Cognitive Psychology.

[26]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  O. Räsänen A computational model of word segmentation from continuous speech using transitional probabilities of atomic acoustic events , 2011, Cognition.

[28]  E Ahissar,et al.  Speech comprehension is correlated with temporal response patterns recorded from auditory cortex , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[29]  M. A. Bouman,et al.  Relation between Hearing Threshold and Duration for Tone Pulses , 1959 .

[30]  N. Viemeister,et al.  Temporal integration and multiple looks. , 1991, The Journal of the Acoustical Society of America.

[31]  Anne Cutler,et al.  The role of strong syllables in segmentation for lexical access , 1988 .

[32]  P. Mermelstein Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[33]  Rudi C. Villing,et al.  Performance Limits for Envelope based Automatic Syllable Segmentation , 2006 .