Detection of Lexical Stress Errors in Non-native (L2) English with Data Augmentation and Attention

This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as syllable nucleus. We propose an attention-based deep learning model that automatically derives optimal syllable-level representation from frame-level and phoneme-level audio features. Training this model is challenging because of the limited amount of incorrect stress patterns. To solve this problem, we propose to augment the training set with incorrectly stressed words generated with Neural TTS. Combining both techniques achieves 94.8% precision and 49.2% recall for the detection of incorrectly stressed words in L2 English speech of Slavic speakers.

[1]  A. E. Hieke Linking as a Marker of Fluent Speech , 1984 .

[2]  M. Posner,et al.  The attention system of the human brain. , 1990, Annual review of neuroscience.

[3]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[4]  Stefanie Shattuck-Hufnagel,et al.  Stress shift and early pitch accent placement in lexical items in American English , 1994 .

[5]  D. V. Bergem Acoustic and Lexical Vowel Reduction , 1995 .

[6]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[7]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[8]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[9]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[10]  John Field Intelligibility and the Listener: The Role of Lexical Stress , 2005 .

[11]  Nan Chen,et al.  Using Nonlinear Features in Automatic English Lexical Stress Detection , 2007, 2007 International Conference on Computational Intelligence and Security Workshops (CISW 2007).

[12]  Lan Wang,et al.  Automatic lexical stress detection for Chinese learners' of English , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[13]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[14]  Jia Liu,et al.  Automatic lexical stress detection using acoustic features for computer-assisted language learning , 2011 .

[15]  Kun Li,et al.  Lexical stress detection for L2 English speech using deep belief networks , 2013, INTERSPEECH.

[16]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[17]  M. G. Busà Intelligibility of English L 2 : The Effects of Incorrect Word Stress Placement and Incorrect Vowel Reduction in the Speech of French and Italian Learners of English , 2014 .

[18]  Kristin Precoda,et al.  Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems , 2015, Speech Commun..

[19]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[21]  Beena Ahmed,et al.  Automatic Classification of Lexical Stress in English and Arabic Languages Using Deep Learning , 2016, INTERSPEECH.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  A. Porzuczek,et al.  English word stress in Polish learners’ speech production and metacompetence , 2017 .

[24]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[25]  Ricardo Gutierrez-Osuna,et al.  L2-ARCTIC: A Non-native English Speech Corpus , 2018, INTERSPEECH.

[26]  Ye-Jee Jung,et al.  Acoustic analysis of English lexical stress produced by Korean, Japanese and Taiwanese-Chinese speakers , 2018 .

[27]  Xu Li,et al.  Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks , 2018, Speech Commun..

[28]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[29]  Srikanth Ronanki,et al.  Effect of Data Reduction on Sequence-to-sequence Neural TTS , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Xunying Liu,et al.  CNN-RNN-CTC Based End-to-end Mispronunciation Detection and Diagnosis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Chiranjeevi Yarra,et al.  ASR Inspired Syllable Stress Detection for Pronunciation Evaluation Without Using a Supervised Classifier and Syllable Level Features , 2019, INTERSPEECH.

[32]  Xiangdong Wang,et al.  An End-to-end Approach for Lexical Stress Detection based on Transformer , 2019, ArXiv.

[33]  Junyuan Xie,et al.  GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing , 2019, J. Mach. Learn. Res..

[34]  Bozena Kostek,et al.  Mispronunciation Detection in Non-Native (L2) English with Uncertainty Modeling , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Heiga Zen,et al.  Parallel Tacotron: Non-Autoregressive and Controllable TTS , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Mark Pullin,et al.  Emulation of physical processes with Emukit , 2021, ArXiv.