Understanding Self-Attention of Self-Supervised Audio Transformers

Self-supervised Audio Transformers (SAT) enable great success in many downstream speech applications like ASR, but how they work has not been widely explored yet. In this work, we present multiple strategies for the analysis of attention mechanisms in SAT. We categorize attentions into explainable categories, where we discover each category possesses its own unique functionality. We provide a visualization tool for understanding multi-head self-attention, importance ranking strategies for identifying critical attention, and attention refinement techniques to improve model performance.

[1]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[2]  Matthias Sperber,et al.  Self-Attentional Acoustic Models , 2018, INTERSPEECH.

[3]  Alexei Baevski,et al.  Effectiveness of self-supervised pre-training for speech recognition , 2019, ArXiv.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[6]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[7]  Ke Xu,et al.  Self-Attention Attribution: Interpreting Information Interactions Inside Transformer , 2020, AAAI.

[8]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[9]  Unto K. Laine,et al.  An improved speech segmentation quality measure: the r-value , 2009, INTERSPEECH.

[10]  Furu Wei,et al.  Visualizing and Understanding the Effectiveness of BERT , 2019, EMNLP.

[11]  Xiangang Li,et al.  Improving Transformer-based Speech Recognition Using Unsupervised Pre-training , 2019, ArXiv.

[12]  Hung-yi Lee,et al.  Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  K. Sri Rama Murty,et al.  Unsupervised Segmentation of Speech Signals Using Kernel-Gram Matrices , 2017, NCVPRIPG.

[15]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[16]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[17]  Pengwei Wang,et al.  Large-Scale Unsupervised Pre-Training for End-to-End Spoken Language Understanding , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Najim Dehak,et al.  Unsupervised Acoustic Segmentation and Clustering Using Siamese Network Embeddings , 2019, INTERSPEECH.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Nikos Fakotakis,et al.  Phonetic segmentation using multiple speech features , 2008, Int. J. Speech Technol..

[21]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[22]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[23]  Cassia Valentini-Botinhao,et al.  Blind speech segmentation using spectrogram image-based features and Mel cepstral coefficients , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[24]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[25]  Jan Niehues,et al.  Very Deep Self-Attention Networks for End-to-End Speech Recognition , 2019, INTERSPEECH.

[26]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[27]  Odette Scharenborg,et al.  Unsupervised speech segmentation: an analysis of the hypothesized phone boundaries. , 2010, The Journal of the Acoustical Society of America.

[28]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[29]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[30]  Alexander Löser,et al.  VisBERT: Hidden-State Visualizations for Transformers , 2020, WWW.

[31]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[32]  Guangsen Wang,et al.  Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks , 2020, INTERSPEECH.

[33]  Hitoshi Yamamoto,et al.  Attention Mechanism in Speaker Recognition: What Does it Learn in Deep Speaker Embedding? , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).