AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Our system is publicly available at \url{https://github.com/AIGC-Audio/AudioGPT}.

[1]  Xu Tan,et al.  HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace , 2023, ArXiv.

[2]  Chenfei Wu,et al.  Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , 2023, ArXiv.

[3]  Li Dong,et al.  Language Is Not All You Need: Aligning Perception with Language Models , 2023, NeurIPS.

[4]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[5]  Zhenhui Ye,et al.  GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis , 2023, ICLR.

[6]  Jia-Bin Huang,et al.  Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models , 2023, ICML.

[7]  Timo I. Denk,et al.  MusicLM: Generating Music From Text , 2023, ArXiv.

[8]  Chao Weng,et al.  Diffsound: Discrete Diffusion Model for Text-to-Sound Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Benoît Sagot,et al.  Generative Spoken Dialogue Language Modeling , 2022, TACL.

[10]  Jong Wook Kim,et al.  Robust Speech Recognition via Large-Scale Weak Supervision , 2022, ICML.

[11]  Gabriel Synnaeve,et al.  High Fidelity Neural Audio Compression , 2022, ArXiv.

[12]  Dongchao Yang,et al.  Audio Pyramid Transformer with Domain Adaption for Weakly Supervised Sound Event Detection and Audio Classification , 2022, INTERSPEECH.

[13]  Shinji Watanabe,et al.  TF-GRIDNET: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  David Grangier,et al.  AudioLM: A Language Modeling Approach to Audio Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Yi Ren,et al.  TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation , 2022, ICLR.

[16]  Yi Ren,et al.  GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis , 2022, NeurIPS.

[17]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[18]  Max W. Y. Lam,et al.  FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis , 2022, IJCAI.

[19]  Qiuqiang Kong,et al.  Separate What You Describe: Language-Queried Audio Source Separation , 2022, INTERSPEECH.

[20]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[21]  Abdel-rahman Mohamed,et al.  textless-lib: a Library for Textless Spoken Language Processing , 2022, NAACL.

[22]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[23]  Lei Xie,et al.  VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[25]  Zhou Zhao,et al.  DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism , 2021, AAAI.

[26]  Marco Tagliasacchi,et al.  SoundStream: An End-to-End Neural Audio Codec , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Haozhe Wu,et al.  Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis , 2021, ACM Multimedia.

[28]  Zhou Zhao,et al.  Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus , 2021, ACM Multimedia.

[29]  Helin Wang,et al.  Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information , 2021, DCASE.

[30]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Florian Metze,et al.  Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks , 2021, NAACL.

[32]  Tie-Yan Liu,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2020, ICLR.

[33]  Fang Liu,et al.  Multi-task Learning based Pre-trained Language Model for Code Completion , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[34]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[35]  Shinji Watanabe,et al.  DiscreTalk: Text-to-Speech as a Machine Translation Problem , 2020, ArXiv.

[36]  R. Socher,et al.  A Simple Language Model for Task-Oriented Dialogue , 2020, Neural Information Processing Systems.

[37]  Alexandra Birch,et al.  Language Model Prior for Low-Resource Neural Machine Translation , 2020, EMNLP.

[38]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[39]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[40]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[41]  Neel Sundaresan,et al.  Pythia: AI-assisted Code Completion System , 2019, KDD.

[42]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[43]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44]  Lars Schmidt-Thieme,et al.  NeuralWarp: Time-Series Similarity with Warping Networks , 2018, ArXiv.

[45]  Yoshua Bengio,et al.  On integrating a language model into neural machine translation , 2017, Comput. Speech Lang..