Masked Autoencoders that Listen
暂无分享,去创建一个
Michael Auli | Florian Metze | Alexei Baevski | Christoph Feichtenhofer | Po-Yao Huang | Juncheng Billy Li | Wojciech Galuba | Hu Xu | Po-Yao (Bernie) Huang | Wojciech Galuba
[1] Dading Chong,et al. Masked Spectrogram Prediction for Self-Supervised Audio Pre-Training , 2022, IEEE International Conference on Acoustics, Speech, and Signal Processing.
[2] K. Kashino,et al. Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation , 2022, ArXiv.
[3] David F. Harwath,et al. MAE-AST: Masked Autoencoding Audio Spectrogram Transformer , 2022, INTERSPEECH.
[4] Florian Metze,et al. AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification , 2022, INTERSPEECH.
[5] James R. Glass,et al. CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification , 2022, ArXiv.
[6] Michael Auli,et al. data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.
[7] S. Dubnov,et al. HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[8] Abdel-rahman Mohamed,et al. Robust Self-Supervised Audio-Visual Speech Recognition , 2022, INTERSPEECH.
[9] A. Yuille,et al. Masked Feature Prediction for Self-Supervised Visual Pre-Training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10] J. Malik,et al. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] James R. Glass,et al. SSAST: Self-Supervised Audio Spectrogram Transformer , 2021, AAAI.
[13] Kritika Singh,et al. Conformer-Based Self-Supervised Learning For Non-Speech Audio Tasks , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[14] Jan Schlüter,et al. Efficient Training of Audio Transformers with Patchout , 2021, INTERSPEECH.
[15] Li Dong,et al. BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.
[16] S. Verbitskiy,et al. ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition , 2021, Pattern Recognit. Lett..
[17] C. Schmid,et al. Attention Bottlenecks for Multimodal Fusion , 2021, NeurIPS.
[18] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[19] Gil Keren,et al. A Time-Domain Convolutional Recurrent Network for Packet Loss Concealment , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[20] Aren Jansen,et al. The Benefit of Temporally-Strong Labels in Audio Event Classification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[21] Andy T. Liu,et al. SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.
[22] James R. Glass,et al. AST: Audio Spectrogram Transformer , 2021, Interspeech.
[23] Saining Xie,et al. An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[24] João F. Henriques,et al. Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[25] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.
[26] James R. Glass,et al. PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[27] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.
[28] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[29] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[30] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.
[31] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.
[32] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[33] Joon Son Chung,et al. Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..
[34] Mark D. Plumbley,et al. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[35] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Andy T. Liu,et al. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2019, IEEE International Conference on Acoustics, Speech, and Signal Processing.
[37] Hung-yi Lee,et al. Deep Long Audio Inpainting , 2019, ArXiv.
[38] Alexei Baevski,et al. Effectiveness of self-supervised pre-training for speech recognition , 2019, ArXiv.
[39] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[40] Geoffrey E. Hinton,et al. When Does Label Smoothing Help? , 2019, NeurIPS.
[41] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.
[42] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[43] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.
[44] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.
[45] Nathanael Perraudin,et al. A Context Encoder For Audio Inpainting , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[46] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[47] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[48] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[49] Pete Warden,et al. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.
[50] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.
[51] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.
[52] Richard Socher,et al. A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.
[53] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[54] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[56] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[57] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.
[58] Richard P. Wildes,et al. Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.
[59] Alexei A. Efros,et al. Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[60] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.
[61] Joon-Hyuk Chang,et al. Packet Loss Concealment Based on Deep Neural Networks for Digital Speech Transmission , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[62] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.
[64] Bin Liu,et al. A novel method of artificial bandwidth extension using deep architecture , 2015, INTERSPEECH.
[65] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[66] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .
[67] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..
[68] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.
[69] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).
[70] Yôiti Suzuki,et al. Equal-loudness-level contours for pure tones. , 2004, The Journal of the Acoustical Society of America.
[71] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.