论文信息 - AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries

AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries

This paper proposes a neural network that performs audio transformations to user-specified sources (e.g., vocals) of a given audio track according to a given description while preserving other sources not mentioned in the description. Audio Manipulation on a Specific Source (AMSS) is challenging because a sound object (i.e., a waveform sample or frequency bin) is 'transparent'; it usually carries information from multiple sources, in contrast to a pixel in an image. To address this challenging problem, we propose AMSS-Net, which extracts latent sources and selectively manipulates them while preserving irrelevant sources. We also propose an evaluation benchmark for several AMSS tasks, and we show that AMSS-Net outperforms baselines on several AMSS tasks via objective metrics and empirical verification.

Soonyoung Jung | Woosung Choi | Jaehwa Chung | Marco A. Mart'inez Ram'irez | Minseok Kim

[1] Joshua D. Reiss,et al. Modeling Nonlinear Audio Effects with End-to-end Deep Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Jakob Abeßer,et al. Automatic Detection of Audio Effects in Guitar and Bass Recordings , 2010 .

[3] Franck Giron,et al. Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Nicu Sebe,et al. Describe What to Change: A Text-guided Unsupervised Image-to-image Translation Approach , 2020, ACM Multimedia.

[5] Thomas Lukasiewicz,et al. ManiGAN: Text-Guided Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Jakob Abeßer,et al. New Sonorities for Early Jazz Recordings Using Sound Source Separation and Automatic Mixing Tools , 2015, ISMIR.

[7] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9] Fabian-Robert Stöter,et al. MUSDB18 - a corpus for music separation , 2017 .

[10] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[11] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12] Tillman Weyde,et al. Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[13] Jen-Yu Liu,et al. Dilated Convolution with Dilated GRU for Music Source Separation , 2019, IJCAI.

[14] Alaa A. Kharbouch,et al. Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[15] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[16] Marco A. Martínez Ramírez,et al. End-to-end equalization with convolutional neural networks , 2018 .

[17] Joshua D. Reiss,et al. Efficient Neural Networks for Real-time Analog Audio Effect Modeling , 2021, ArXiv.

[18] Tat-Seng Chua,et al. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Ron J. Weiss,et al. Unsupervised Sound Separation Using Mixture Invariant Training , 2020, NeurIPS.

[20] Joan Serra,et al. Automatic Multitrack Mixing With A Differentiable Mixing Console Of Neural Audio Effects , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] M. M. Ramírez,et al. A Deep Learning Approach to Intelligent Drum Mixing With the Wave-U-Net , 2021, Journal of the Audio Engineering Society.

[22] Thomas S. Huang,et al. Generative Image Inpainting with Contextual Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23] Minseok Kim,et al. Investigating U-Nets with various Intermediate Blocks for Spectrogram-based Singing Voice Separation , 2019, ISMIR.

[24] Emmanouil Benetos,et al. Deep Learning for Black-Box Modeling of Audio Effects , 2020, Applied Sciences.

[25] Gabriel Meseguer-Brocal,et al. Conditioned-U-Net: Introducing a Control Mechanism in the U-Net for Multiple Source Separations , 2019, ISMIR.

[26] Aaron C. Courville,et al. FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[27] Lonce L. Wyse,et al. Audio Spectrogram Representations for Processing with Convolutional Neural Networks , 2017, ArXiv.

[28] Masashi Unoki,et al. A Skip Attention Mechanism for Monaural Singing Voice Separation , 2019, IEEE Signal Processing Letters.

[29] Aditya Ganeshan,et al. Meta-Learning Extractors for Music Source Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Zhiwei Xiong,et al. PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network , 2019, AAAI.

[31] Tuomas Virtanen,et al. Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] 拓海杉山,et al. “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[33] Lauri Juvela,et al. Real-Time Guitar Amplifier Emulation with Deep Learning , 2020, Applied Sciences.

[34] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[35] C. Avendano,et al. Frequency-domain source identification and manipulation in stereo mixes for enhancement, suppression and re-panning applications , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[36] Gerald Schuller,et al. New Sonorities for Jazz Recordings: Separation and Mixing using Deep Neural Networks , 2016 .

[37] Soonyoung Jung,et al. Lasaft: Latent Source Attentive Frequency Transformation For Conditioned Source Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).