Sequence segmentation using joint RNN and structured prediction models

We describe and analyze a simple and effective algorithm for sequence segmentation applied to speech processing tasks. We propose a neural architecture that is composed of two modules trained jointly: a recurrent neural network (RNN) module and a structured prediction model. The RNN outputs are considered as feature functions to the structured model. The overall model is trained with a structured loss function which can be designed to the given segmentation task. We demonstrate the effectiveness of our method by applying it to two simple tasks commonly used in phonetic studies: word segmentation and voice onset time segmentation. Results suggest the proposed model is superior to previous methods, obtaining state-of-the-art results on the tested datasets.

[1]  Keelan Evanini,et al.  FAVE (Forced Alignment and Vowel Extraction) Suite Version 1.1.3 , 2014 .

[2]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[3]  Thierry Artières,et al.  Neural conditional random fields , 2010, AISTATS.

[4]  Peter Graff,et al.  Longitudinal phonetic variation in a closed system , 2009 .

[5]  Richard M. Schwartz,et al.  Transcribing radio news , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[7]  Joseph Keshet,et al.  Automatic Measurement of Voice Onset Time and Prevoicing Using Recurrent Neural Networks , 2016, INTERSPEECH.

[8]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[9]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  John H. L. Hansen,et al.  Automatic voice onset time detection for unvoiced stops (/p/, /t/, /k/) with application to accent classification , 2010, Speech Commun..

[11]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[12]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[13]  Joseph Keshet,et al.  Vowel duration measurement using deep neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[14]  Gayatri M. Bhandari,et al.  Audio Segmentation for Speech Recognition Using Segment Features , 2014 .

[15]  Yang Wang,et al.  rnn : Recurrent Library for Torch , 2015, ArXiv.

[16]  L. Lisker,et al.  A Cross-Language Study of Voicing in Initial Stops: Acoustical Measurements , 1964 .

[17]  Morgan Sonderegger,et al.  Automatic measurement of voice onset time using discriminative structured prediction. , 2012, The Journal of the Acoustical Society of America.

[18]  Eliyahu Kiperwasser,et al.  Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations , 2016, TACL.

[19]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  J. Elman Distributed representations, simple recurrent networks, and grammatical structure , 1991, Machine Learning.

[21]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[22]  Yoram Singer,et al.  A Large Margin Algorithm for Speech-to-Phoneme and Music-to-Score Alignment , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Luis A. Hernández Gómez,et al.  Automatic phonetic segmentation , 2003, IEEE Trans. Speech Audio Process..

[24]  Francis Eustache,et al.  Voice onset time in aphasia, apraxia of speech and dysarthria: a review , 2000 .

[25]  Nattalia Paterson,et al.  Interactions in Bilingual Speech Processing , 2011 .

[26]  Alan L. Yuille,et al.  Learning Deep Structured Models , 2014, ICML.