Spoken Term Detection Methods for Sparse Transcription in Very Low-resource Settings

We investigate the efficiency of two very different spoken term detection approaches for transcription when the available data is insufficient to train a robust ASR system. This work is grounded in very low-resource language documentation scenario where only few minutes of recording have been transcribed for a given language so far. Experiments on two oral languages show that a pretrained universal phone recognizer, fine-tuned with only a few minutes of target language speech, can be used for spoken term detection with a better overall performance than a dynamic time warping approach. In addition, we show that representing phoneme recognition ambiguity in a graph structure can further boost the recall while maintaining high precision in the low resource spoken term detection task.

[1]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[2]  Karen Livescu,et al.  Discriminative acoustic word embeddings: Tecurrent neural network-based approaches , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[3]  Aren Jansen,et al.  Unsupervised neural network based feature extraction using weak top-down constraints , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  C. Cieri,et al.  Evaluating phonemic transcription of low-resource tonal languages for language documentation , 2018 .

[5]  Graham Neubig,et al.  Integrating automatic transcription into the language documentation workflow: Experiments with Na data and the Persephone toolkit , 2018 .

[6]  Florian Schiel,et al.  Multilingual processing of speech via web services , 2017, Comput. Speech Lang..

[7]  Sebastian Stüker,et al.  A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments , 2017, LREC.

[8]  Thomas Niesler,et al.  Feature Exploration for Almost Zero-Resource ASR-Free Keyword Spotting Using a Multilingual Bottleneck Extractor and Correspondence Autoencoders , 2018, INTERSPEECH.

[9]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[10]  Paul Felt,et al.  Improving the Effectiveness of Machine-Assisted Annotation , 2012 .

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Karen Livescu,et al.  Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings , 2017, INTERSPEECH.

[13]  Caren Brinckmann,et al.  Transcription bottleneck of speech corpus exploitation , 2008 .

[14]  Laurent Besacier,et al.  Enabling Interactive Transcription in an Indigenous Community , 2020, COLING.

[15]  Steven Bird,et al.  Sparse Transcription , 2021, Computational Linguistics.

[16]  Vishwa Gupta,et al.  Speech Transcription Challenges for Resource Constrained Indigenous Language Cree , 2020, SLTU.

[17]  Scott Heath,et al.  Building Speech Recognition Systems for Language Documentation: The CoEDL Endangered Language Pipeline and Inference System (ELPIS) , 2018, SLTU.

[18]  Vishwa Gupta,et al.  Automatic Transcription Challenges for Inuktitut, a Low-Resource Polysynthetic Language , 2020, LREC.

[19]  Alan W Black,et al.  Universal Phone Recognition with a Multilingual Allophone System , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Florian Metze,et al.  AlloVera: A Multilingual Allophone Database , 2020, LREC.

[21]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[22]  Christopher Cox,et al.  User-friendly Automatic Transcription of Low-resource Languages: Plugging ESPnet into Elpis , 2020, COMPUTEL.