论文信息 - Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer - 字舞流文

Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer

In this paper, we seek to reduce the computation complexity of transformer-based models for speech representation learning. We evaluate 10 attention mechanisms; then, we pre-train the transformer-based model with those attentions in a self-supervised fashion and use them as feature extractors on downstream tasks, including phoneme classification and speaker classification. We find that the proposed approach, which only uses hand-crafted and learnable attentions, is comparable with the full self-attention.

Hung-yi Lee | Tsung-Han Wu | Po-Han Chi | Chun-Chen Hsieh | Yen-Hao Chen | Hung-yi Lee | Po-Han Chi | Tsung-Han Wu | Chun-Chen Hsieh | Yen-Hao Chen

[1] Alexei Baevski,et al. Effectiveness of self-supervised pre-training for speech recognition , 2019, ArXiv.

[2] Shang-Wen Li,et al. Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[3] Edouard Grave,et al. Adaptive Attention Span in Transformers , 2019, ACL.

[4] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[5] Yi Tay,et al. Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[6] Xiangang Li,et al. Improving Transformer-based Speech Recognition Using Unsupervised Pre-training , 2019, ArXiv.

[7] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[8] Anthony K. H. Tung,et al. Accurate and Fast Asymmetric Locality-Sensitive Hashing Scheme for Maximum Inner Product Search , 2018, KDD.

[9] Zhijian Liu,et al. Lite Transformer with Long-Short Range Attention , 2020, ICLR.

[10] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12] Guangsen Wang,et al. Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks , 2020, INTERSPEECH.

[13] Yoshua Bengio,et al. Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[14] Yves Scherrer,et al. Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation , 2020, EMNLP.

[15] Aurko Roy,et al. Efficient Content-Based Sparse Attention with Routing Transformers , 2020, TACL.

[16] Bowen Zhou,et al. A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[17] Ping Li,et al. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.

[18] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[19] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[20] Ulrich Paquet,et al. Speeding up the Xbox recommender system using a euclidean transformation for inner-product spaces , 2014, RecSys '14.

[21] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[22] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[23] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[24] Ping Li,et al. Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS) , 2014, UAI.

[25] Nathan Srebro,et al. On Symmetric and Asymmetric LSHs for Inner Product Search , 2014, ICML.