SylNet: An Adaptable End-to-End Syllable Count Estimator for Speech

Automatic syllable count estimation (SCE) is used in a variety of applications ranging from speaking rate estimation to detecting social activity from wearable microphones or developmental research concerned with quantifying speech heard by language-learning children in different environments. The majority of previously utilized SCE methods have relied on heuristic digital signal processing (DSP) methods, and only a small number of bi-directional long short-term memory (BLSTM) approaches have made use of modern machine learning approaches in the SCE task. This letter presents a novel end-to-end method called SylNet for automatic syllable counting from speech, built on the basis of a recent developments in neural network architectures. We describe how the entire model can be optimized directly to minimize SCE error on the training data without annotations aligned at the syllable level, and how it can be adapted to new languages using limited speech data with known syllable counts. Experiments on several different languages reveal that SylNet generalizes to languages beyond its training data and further improves with adaptation. It also outperforms several previously proposed methods for syllabification, including end-to-end BLSTMs.

[1]  Gang Hua,et al.  Ordinal Regression with Multiple Output CNN for Age Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Eric Fosler-Lussier,et al.  Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[4]  Florian Metze,et al.  Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech , 2019, Speech Commun..

[5]  Bayya Yegnanarayana,et al.  Syllable nuclei detection using perceptually significant features , 2013, INTERSPEECH.

[6]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Weonhee Yun,et al.  The Korean Corpus of Spontaneous Speech , 2015 .

[8]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Stephen G Parker,et al.  Quantifying the sonority hierarchy , 2002 .

[10]  Okko Johannes Räsänen,et al.  Comparison of Syllabification Algorithms and Training Strategies for Robust Word Count Estimation across Different Languages and Recording Conditions , 2018, INTERSPEECH.

[11]  Mietta Lennes Segmental features in spontaneous and read-aloud Finnish , 2009 .

[12]  Axel Röbel,et al.  Syll-O-Matic: An adaptive time-frequency representation for the automatic segmentation of speech into syllables , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  John H. L. Hansen,et al.  Effective word count estimation for long duration daily naturalistic audio recordings , 2016, Speech Commun..

[15]  Gianluca Pollastri,et al.  A neural network approach to ordinal regression , 2007, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[16]  P. Mermelstein Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[17]  Pedro Antonio Gutiérrez,et al.  Ordinal Regression Methods: Survey and Experimental Study , 2016, IEEE Transactions on Knowledge and Data Engineering.

[18]  Shrikanth S. Narayanan,et al.  Robust Speech Rate Estimation for Spontaneous Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Björn W. Schuller,et al.  Syllabification of conversational speech using Bidirectional Long-Short-Term Memory Neural Networks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Elika Bergelson,et al.  What Do North American Babies Hear? A large-scale cross-corpus analysis. , 2018, Developmental science.

[21]  Anne Fernald,et al.  Talking to Children Matters , 2013, Psychological science.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  M. Avanzi,et al.  C-PROM : An Annotated Corpus for French Prominence Study , 2010 .

[24]  John H. L. Hansen,et al.  Prof-Life-Log: Analysis and classification of activities in daily audio streams , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Okko Räsänen,et al.  Pre-linguistic segmentation of speech into syllable-like units , 2018, Cognition.

[26]  Rudi C. Villing,et al.  Automatic Blind Syllable Segmentation for Continuous Speech , 2004 .

[27]  Visar Berisha,et al.  Online speaking rate estimation using recurrent neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).