AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries

This paper proposes a neural network that performs audio transformations to user-specified sources (e.g., vocals) of a given audio track according to a given description while preserving other sources not mentioned in the description. Audio Manipulation on a Specific Source (AMSS) is challenging because a sound object (i.e., a waveform sample or frequency bin) is 'transparent'; it usually carries information from multiple sources, in contrast to a pixel in an image. To address this challenging problem, we propose AMSS-Net, which extracts latent sources and selectively manipulates them while preserving irrelevant sources. We also propose an evaluation benchmark for several AMSS tasks, and we show that AMSS-Net outperforms baselines on several AMSS tasks via objective metrics and empirical verification.

[1]  Joshua D. Reiss,et al.  Modeling Nonlinear Audio Effects with End-to-end Deep Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jakob Abeßer,et al.  Automatic Detection of Audio Effects in Guitar and Bass Recordings , 2010 .

[3]  Franck Giron,et al.  Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Nicu Sebe,et al.  Describe What to Change: A Text-guided Unsupervised Image-to-image Translation Approach , 2020, ACM Multimedia.

[5]  Thomas Lukasiewicz,et al.  ManiGAN: Text-Guided Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jakob Abeßer,et al.  New Sonorities for Early Jazz Recordings Using Sound Source Separation and Automatic Mixing Tools , 2015, ISMIR.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Fabian-Robert Stöter,et al.  MUSDB18 - a corpus for music separation , 2017 .

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[13]  Jen-Yu Liu,et al.  Dilated Convolution with Dilated GRU for Music Source Separation , 2019, IJCAI.

[14]  Alaa A. Kharbouch,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[15]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[16]  Marco A. Martínez Ramírez,et al.  End-to-end equalization with convolutional neural networks , 2018 .

[17]  Joshua D. Reiss,et al.  Efficient Neural Networks for Real-time Analog Audio Effect Modeling , 2021, ArXiv.

[18]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ron J. Weiss,et al.  Unsupervised Sound Separation Using Mixture Invariant Training , 2020, NeurIPS.

[20]  Joan Serra,et al.  Automatic Multitrack Mixing With A Differentiable Mixing Console Of Neural Audio Effects , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  M. M. Ramírez,et al.  A Deep Learning Approach to Intelligent Drum Mixing With the Wave-U-Net , 2021, Journal of the Audio Engineering Society.

[22]  Thomas S. Huang,et al.  Generative Image Inpainting with Contextual Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Minseok Kim,et al.  Investigating U-Nets with various Intermediate Blocks for Spectrogram-based Singing Voice Separation , 2019, ISMIR.

[24]  Emmanouil Benetos,et al.  Deep Learning for Black-Box Modeling of Audio Effects , 2020, Applied Sciences.

[25]  Gabriel Meseguer-Brocal,et al.  Conditioned-U-Net: Introducing a Control Mechanism in the U-Net for Multiple Source Separations , 2019, ISMIR.

[26]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[27]  Lonce L. Wyse,et al.  Audio Spectrogram Representations for Processing with Convolutional Neural Networks , 2017, ArXiv.

[28]  Masashi Unoki,et al.  A Skip Attention Mechanism for Monaural Singing Voice Separation , 2019, IEEE Signal Processing Letters.

[29]  Aditya Ganeshan,et al.  Meta-Learning Extractors for Music Source Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Zhiwei Xiong,et al.  PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network , 2019, AAAI.

[31]  Tuomas Virtanen,et al.  Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[33]  Lauri Juvela,et al.  Real-Time Guitar Amplifier Emulation with Deep Learning , 2020, Applied Sciences.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  C. Avendano,et al.  Frequency-domain source identification and manipulation in stereo mixes for enhancement, suppression and re-panning applications , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[36]  Gerald Schuller,et al.  New Sonorities for Jazz Recordings: Separation and Mixing using Deep Neural Networks , 2016 .

[37]  Soonyoung Jung,et al.  Lasaft: Latent Source Attentive Frequency Transformation For Conditioned Source Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).