On the Difficulty of Segmenting Words with Attention

Word segmentation, the problem of finding word boundaries in speech, is of interest for a range of tasks. Previous papers have suggested that for sequence-to-sequence models trained on tasks such as speech translation or speech recognition, attention can be used to locate and segment the words. We show, however, that even on monolingual data this approach is brittle. In our experiments with different input types, data sizes, and segmentation algorithms, only models trained to predict phones from words succeed in the task. Models trained to predict words from either phones or speech (i.e., the opposite direction needed to generalize to new data), yield much worse results, suggesting that attention-based segmentation is only useful in limited scenarios.

[1]  James R. Glass,et al.  Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces , 2018, NeurIPS.

[2]  Alexei Baevski,et al.  Unsupervised Speech Recognition , 2021, ArXiv.

[3]  Michael C. Frank,et al.  Unsupervised word discovery from speech using automatic segmentation into syllable-like units , 2015, INTERSPEECH.

[4]  Fethi Bougares,et al.  NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems , 2017, Prague Bull. Math. Linguistics.

[5]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[6]  Yoshua Bengio,et al.  Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016, NIPS.

[7]  Aren Jansen,et al.  A segmental framework for fully-unsupervised large-vocabulary speech recognition , 2016, Comput. Speech Lang..

[8]  Hao Tang,et al.  End-to-End Neural Segmental Models for Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[9]  James R. Glass,et al.  Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech , 2018, INTERSPEECH.

[10]  Aline Villavicencio,et al.  Unwritten languages demand attention too! Word discovery with encoder-decoder models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11]  Aline Villavicencio,et al.  Unsupervised Word Segmentation from Speech with Attention , 2018, INTERSPEECH.

[12]  Marcely Zanon Boito,et al.  Investigating alignment interpretability for low-resource NMT , 2020, Machine Translation.

[13]  Mattia Antonino Di Gangi,et al.  MuST-C: a Multilingual Speech Translation Corpus , 2019, NAACL.

[14]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15]  Florian Metze,et al.  Acoustic-to-Word Recognition with Sequence-to-Sequence Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[16]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[17]  Biing-Hwang Juang,et al.  The segmental K-means algorithm for estimating parameters of hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[18]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[19]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[20]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.