Few-Shot Keyword Spotting in Any Language

We introduce a few-shot transfer learning method for keyword spotting in any language. Leveraging open speech corpora in nine languages, we automate the extraction of a large multilingual keyword bank and use it to train an embedding model. With just five training examples, we fine-tune the embedding model for keyword spotting and achieve an average F1 score of 0.75 on keyword classification for 180 new keywords unseen by the embedding model in these nine languages. This embedding model also generalizes to new languages. We achieve an average F1 score of 0.65 on 5-shot models for 260 keywords sampled across 13 new languages unseen by the embedding model. We investigate streaming accuracy for our 5-shot models in two contexts: keyword spotting and keyword search. Across 440 keywords in 22 languages, we achieve an average streaming keyword spotting accuracy of 87.4% with a false acceptance rate of 4.3%, and observe promising initial results on keyword search.

[1]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[2]  Predicting detection filters for small footprint open-vocabulary keyword spotting , 2019, INTERSPEECH.

[3]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[4]  Jimmy J. Lin,et al.  Deep Residual Learning for Small-Footprint Keyword Spotting , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Thomas Niesler,et al.  Fast ASR-free and almost zero-resource keyword spotting using DTW and CNNs for humanitarian monitoring , 2018, INTERSPEECH.

[6]  James Lin,et al.  Training Keyword Spotters with Limited and Synthesized Speech Data , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Dan Jurafsky,et al.  Leveraging neural representations for facilitating access to untranscribed speech from endangered languages , 2021, ArXiv.

[9]  V. Reddi,et al.  TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems , 2020, MLSys.

[10]  Matthew Mattina,et al.  MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers , 2020, MLSys.

[11]  Tara N. Sainath,et al.  Query-by-example keyword spotting using long short-term memory networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[13]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[14]  Kevin Kilgour,et al.  Teaching keyword spotters to spot new keywords with limited examples , 2021, Interspeech.

[15]  Thomas Niesler,et al.  Feature Exploration for Almost Zero-Resource ASR-Free Keyword Spotting Using a Multilingual Bottleneck Extractor and Correspondence Autoencoders , 2018, INTERSPEECH.

[16]  Sharon Goldwater,et al.  Multilingual bottleneck features for subword modeling in zero-resource languages , 2018, INTERSPEECH.

[17]  Yundong Zhang,et al.  Hello Edge: Keyword Spotting on Microcontrollers , 2017, ArXiv.

[18]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.