Phoneme Boundary Detection using Deep Bidirectional LSTMs

In this paper we investigate the automatic detection of phoneme boundaries in audio recordings with the help of deep bidirectional LSTMs. This work is motivated by the needs of the project BULB which aims to support linguists in documenting unwritten languages. The automatic detection of phoneme boundaries in audio recordings of a new language is part of the technical requirements of the BULB project. For our first experiments with LSTMs for this task, we worked on TIMIT and BUCKEYE and measured the performance of our LSTMs using accuracy, precision, recall and F-measure. We then applied the trained networks crosslingually to Basaa, one of the Bantu languages addressed in BULB. With the LSTMs trained for this paper we achieve a phoneme segmentation performance on TIMIT that, to the best of our knowledge, outperforms the systems reported in literature so far.

[1]  M. Guthrie The classification of the Bantu languages , 1948 .

[2]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[3]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[4]  Maurizio Omologo,et al.  Automatic segmentation and labeling of speech based on Hidden Markov Models , 1993, Speech Commun..

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[7]  David Crystal,et al.  Language Death: Preface , 2000 .

[8]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[9]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[10]  D. Nurse,et al.  The Bantu Languages , 2003 .

[11]  J. Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[12]  Lawrence R. Rabiner,et al.  On the Relation between Maximum Spectra Boundaries , 2006 .

[13]  Hsin-Min Wang,et al.  Improved HMM/SVM methods for automatic phoneme segmentation , 2007, INTERSPEECH.

[14]  Nobuaki Minematsu,et al.  Unsupervised optimal phoneme segmentation: Objectives, algorithm and comparisons , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  P. Lewis Ethnologue : languages of the world , 2009 .

[16]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[17]  Odette Scharenborg,et al.  Unsupervised speech segmentation: an analysis of the hypothesized phone boundaries. , 2010, The Journal of the Acoustical Society of America.

[18]  Larry M. Hyman Markedness, Faithfulness, and the Typology of Two-height Tone Systems , 2011 .

[19]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[20]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[21]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[22]  Cheng-Yuan Liou,et al.  Segmentation of DNA using simple recurrent neural network , 2012, Knowledge-Based Systems.

[23]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[24]  Andreas Stolcke,et al.  Highly accurate phonetic segmentation using boundary correction models and system fusion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[26]  Ming Zhou,et al.  A Recursive Recurrent Neural Network for Statistical Machine Translation , 2014, ACL.

[27]  Alan W. Black,et al.  Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Mark Liberman,et al.  Automatic phonetic segmentation in Mandarin Chinese: Boundary models, glottal features and tone , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  James R. Glass,et al.  Unsupervised Lexicon Discovery from Acoustic Input , 2015, TACL.

[30]  Alan W. Black,et al.  Using articulatory features and inferred phonological segments in zero resource speech processing , 2015, INTERSPEECH.

[31]  Simon King,et al.  Phonetic segmentation of speech using STEP and t-SNE , 2015, 2015 International Conference on Speech Technology and Human-Computer Dialogue (SpeD).

[32]  Satoshi Nakamura,et al.  Unsupervised Phoneme Segmentation of Previously Unseen Languages , 2016, INTERSPEECH.

[33]  Aren Jansen,et al.  The Zero Resource Speech Challenge 2015: Proposed Approaches and Results , 2016, SLTU.

[34]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[35]  Sebastian Stüker,et al.  Innovative technologies for under-resourced language documentation: The BULB Project , 2016 .