SpeechPrompt v2: Prompt Tuning for Speech Classification Tasks

Prompt tuning is a technology that tunes a small set of parameters to steer a pre-trained language model (LM) to directly generate the output for downstream tasks. Recently, prompt tuning has demonstrated its storage and computation efficiency in both natural language processing (NLP) and speech processing fields. These advantages have also revealed prompt tuning as a candidate approach to serving pre-trained LM for multiple tasks in a unified manner. For speech processing, SpeechPrompt shows its high parameter efficiency and competitive performance on a few speech classification tasks. However, whether SpeechPrompt is capable of serving a large number of tasks is unanswered. In this work, we propose SpeechPrompt v2, a prompt tuning framework capable of performing a wide variety of speech classification tasks, covering multiple languages and prosody-related tasks. The experiment result shows that SpeechPrompt v2 achieves performance on par with prior works with less than 0.15M trainable parameters in a unified framework.

[1]  Hiroaki Hayashi,et al.  Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing , 2021, ACM Comput. Surv..

[2]  W. Reif,et al.  Finstreder: Simple and fast Spoken Language Understanding with Finite State Transducers using modern Speech-to-Text models , 2022, ArXiv.

[3]  P. Bhattacharyya,et al.  A Multimodal Corpus for Emotion Recognition in Sarcasm , 2022, LREC.

[4]  Hung-yi Lee,et al.  Structured Prompt Tuning , 2022, ArXiv.

[5]  Tara N. Sainath,et al.  Self-Supervised Speech Representation Learning: A Review , 2022, IEEE Journal of Selected Topics in Signal Processing.

[6]  Kai-Wei Chang,et al.  SpeechPrompt: An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks , 2022, 2203.16773.

[7]  M. Hasegawa-Johnson,et al.  WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models , 2022, INTERSPEECH.

[8]  S. Dubnov,et al.  HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  A. Jansen,et al.  Universal Paralinguistic Speech Representations Using self-Supervised Conformers , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Abdel-rahman Mohamed,et al.  Text-Free Prosody-Aware Generative Spoken Language Modeling , 2021, ACL.

[11]  Tomi Kinnunen,et al.  Voxceleb Enrichment for Age and Gender Recognition , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12]  Chao-Han Huck Yang,et al.  Voice2Series: Reprogramming Acoustic Models for Time Series Classification , 2021, ICML.

[13]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Tomi Kinnunen,et al.  ASVspoof 2019: Spoofing Countermeasures for the Detection of Synthesized, Converted and Replayed Speech , 2021, IEEE Transactions on Biometrics, Behavior, and Identity Science.

[15]  Emmanuel Dupoux,et al.  On Generative Spoken Language Modeling from Raw Audio , 2021, Transactions of the Association for Computational Linguistics.

[16]  Roman Vygon,et al.  Learning Efficient Representations for Keyword Spotting with Triplet Loss , 2021, SPECOM.

[17]  Boris Ginsburg,et al.  MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Yu Tsao,et al.  A Study of Low-Resource Speech Commands Recognition based on Adversarial Reprogramming , 2021, ArXiv.

[19]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[20]  Dmitrij Šešok,et al.  Unsupervised Pre-Training for Voice Activation , 2020, Applied Sciences.

[21]  Philip John Gorinski,et al.  Improving End-to-End Speech-to-Intent Classification with Reptile , 2020, INTERSPEECH.

[22]  Afroz Ahamad,et al.  AccentDB: A Database of Non-Native English Accents to Assist Neural Speech Recognition , 2020, LREC.

[23]  Verónica Pérez-Rosas,et al.  Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper) , 2019, ACL.

[24]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[25]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[26]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[27]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[28]  Xavier Serra,et al.  Freesound Datasets: A Platform for the Creation of Open Audio Datasets , 2017, ISMIR.

[29]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[30]  Hugo Van hamme,et al.  Acquisition of ordinal words using weakly supervised NMF , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[31]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.