论文信息 - Deep Learning for Audio Signal Processing

Deep Learning for Audio Signal Processing

Given the recent surge in developments of deep learning, this paper provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e., audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.

[1] Meinard Müller,et al. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network , 2018, ISMIR.

[2] Geoffrey Zweig,et al. Advances in all-neural speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Hermann Ney,et al. Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[4] Geoffrey Zweig,et al. LSTM time and frequency recurrence for automatic speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5] Arun Narayanan,et al. Toward Domain-Invariant Speech Recognition via Large Scale Training , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[6] Klaus Obermayer,et al. A new method for tracking modulations in tonal music in audio data format , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[7] Jorge Calvo-Zaragoza,et al. An End-to-end Framework for Audio-to-Score Music Transcription on Monophonic Excerpts , 2018, ISMIR.

[8] Thomas Grill,et al. Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks , 2015, ISMIR.

[9] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[10] DeLiang Wang,et al. Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Jürgen Schmidhuber,et al. Multi-dimensional Recurrent Neural Networks , 2007, ICANN.

[12] Juhan Nam,et al. Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms , 2017, ArXiv.

[13] Liang Lu,et al. On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Justin Salamon,et al. Deep Salience Representations for F0 Estimation in Polyphonic Music , 2017, ISMIR.

[15] Georg Heigold,et al. Sequence discriminative distributed training of long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[16] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17] Naoyuki Kanda,et al. Elastic spectral distortion for low resource speech recognition with deep neural networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[18] Léon Bottou,et al. Wasserstein Generative Adversarial Networks , 2017, ICML.

[19] Allen Huang,et al. Deep Learning for Music , 2016, ArXiv.

[20] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[22] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[23] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[24] Geoffrey E. Hinton,et al. Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Tara N. Sainath,et al. Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Heikki Huttunen,et al. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[28] Juan Pablo Bello,et al. A Software Framework for Musical Data Augmentation , 2015, ISMIR.

[29] Alfred O. Hero,et al. Complex input convolutional neural networks for wide angle SAR ATR , 2016, 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[30] Nima Mesgarani,et al. Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Liang Lu,et al. Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Emmanuel Vincent,et al. Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[34] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35] Florian Krebs,et al. Joint Beat and Downbeat Tracking with Recurrent Neural Networks , 2016, ISMIR.

[36] Jesper Jensen,et al. Monaural Speech Enhancement Using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.

[38] Yu Tsao,et al. SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement , 2016, INTERSPEECH.

[39] Quoc V. Le,et al. Listen, Attend and Spell , 2015, ArXiv.

[40] Marco Cuturi,et al. Soft-DTW: a Differentiable Loss Function for Time-Series , 2017, ICML.

[41] Hemant A. Patil,et al. Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[43] Hagen Soltau,et al. Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[44] Gaëtan Hadjeres,et al. Deep Learning Techniques for Music Generation - A Survey , 2017, ArXiv.

[45] DeLiang Wang,et al. A New Framework for Supervised Speech Enhancement in the Time Domain , 2018, INTERSPEECH.

[46] Adam Lopez,et al. Towards speech-to-text translation without speech recognition , 2017, EACL.

[47] Mark B. Sandler,et al. Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[48] Archontis Politis,et al. Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network , 2017, 2018 26th European Signal Processing Conference (EUSIPCO).

[49] Jun Du,et al. An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[50] Geoffrey Zweig,et al. Exploring multidimensional lstms for large vocabulary ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51] Tatsuya Harada,et al. Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[52] Sandeep Subramanian,et al. Deep Complex Networks , 2017, ICLR.

[53] Paris Smaragdis,et al. Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[54] Tara N. Sainath,et al. Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[55] 渡邊澄夫. Algebraic geometry and statistical learning theory , 2009 .

[56] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[57] F ROSENBLATT,et al. The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[58] Hervé Bourlard,et al. Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[59] Tara N. Sainath,et al. Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks , 2016, INTERSPEECH.

[60] Tuomas Virtanen,et al. Filterbank learning for deep neural network based polyphonic sound event detection , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[61] Zachary Chase Lipton. A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[62] Khe Chai Sim,et al. A Spectral Masking Approach to Noise-Robust Speech Recognition Using Deep Neural Networks , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[63] Tara N. Sainath,et al. Endpoint Detection Using Grid Long Short-Term Memory Networks for Streaming Speech Recognition , 2017, INTERSPEECH.

[64] Gaël Richard,et al. Robust Downbeat Tracking Using an Ensemble of Convolutional Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[65] Erich Elsen,et al. Efficient Neural Audio Synthesis , 2018, ICML.

[66] Thomas Grill,et al. Boundary Detection in Music Structure Analysis using Convolutional Neural Networks , 2014, ISMIR.

[67] Tara N. Sainath,et al. Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[68] M. Picheny,et al. Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[69] Jan Schlüter,et al. Learning to Pinpoint Singing Voice from Weakly Labeled Examples , 2016, ISMIR.

[70] Jacob Benesty,et al. Fundamentals of Noise Reduction , 2008 .

[71] Tara N. Sainath,et al. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[72] Navdeep Jaitly,et al. Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[73] Kevin Skadron,et al. A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[74] Slim Essid,et al. Analysis of Common Design Choices in Deep Learning Systems for Downbeat Tracking , 2018, ISMIR.

[75] Yoshua Bengio,et al. Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[76] Zaïd Harchaoui,et al. Learning Features of Music from Scratch , 2016, ICLR.

[77] Geoffrey E. Hinton,et al. Deep Belief Networks for phone recognition , 2009 .

[78] Gerhard Widmer,et al. Feature Learning for Chord Recognition: The Deep Chroma Extractor , 2016, ISMIR.

[79] Tara N. Sainath,et al. Temporal Modeling Using Dilated Convolution and Gating for Voice-Activity-Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[80] Benjamin Schrauwen,et al. End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[81] Douglas Eck,et al. A Supervised Classification Algorithm for Note Onset Detection , 2006, EURASIP J. Adv. Signal Process..

[82] DeLiang Wang,et al. A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[83] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[84] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[85] Tara N. Sainath,et al. Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[86] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[87] Geoffrey E. Hinton,et al. Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[88] DeLiang Wang,et al. On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[89] Vaibhava Goel,et al. Dense Prediction on Sequences with Time-Dilated Convolutions for Speech Recognition , 2016, ArXiv.

[90] Cem Anil,et al. TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer , 2018, ICLR.

[91] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[92] Lawrence R. Rabiner,et al. Automatic Speech Recognition - A Brief History of the Technology Development , 2004 .

[93] Tatsuya Kawahara,et al. Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[94] Annamaria Mesaros,et al. Metrics for Polyphonic Sound Event Detection , 2016 .

[95] Ron J. Weiss,et al. Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[96] Daniel P. W. Ellis,et al. Better beat tracking through robust onset aggregation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[98] S. Furui,et al. Speaker-independent isolated word recognition based on emphasized spectral dynamics , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[99] Sanjeev Khudanpur,et al. Audio augmentation for speech recognition , 2015, INTERSPEECH.

[100] Mark J. F. Gales,et al. Improving the interpretability of deep neural networks with stimulated learning , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[101] Vladlen Koltun,et al. Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[102] Bo Chen,et al. High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder , 2018, INTERSPEECH.

[103] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[104] Geoffrey E. Hinton,et al. Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[105] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[106] Sanjeev Khudanpur,et al. Acoustic Modelling from the Signal Domain Using CNNs , 2016, INTERSPEECH.

[107] Sebastian Böck,et al. Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[108] Thierry Bertin-Mahieux,et al. The Million Song Dataset , 2011, ISMIR.

[109] Heiga Zen,et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[110] Tara N. Sainath,et al. FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[111] Stefan B. Williams,et al. Sound Source Localization in a Multipath Environment Using Convolutional Neural Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[112] Heikki Huttunen,et al. Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[113] DeLiang Wang,et al. Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[114] Sander Dieleman,et al. Piano Genie , 2018, IUI.

[115] Karen Simonyan,et al. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[116] Dimitri Palaz,et al. Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[117] Ph. Tchamitchian,et al. Wavelets: Time-Frequency Methods and Phase Space , 1992 .

[118] Juan Pablo Bello,et al. Structured Training for Large-Vocabulary Chord Recognition , 2017, ISMIR.

[119] Steve Renals,et al. THE USE OF RECURRENT NEURAL NETWORKS IN CONTINUOUS SPEECH RECOGNITION , 1996 .

[120] Yong Xu,et al. Iterative Deep Neural Networks for Speaker-Independent Binaural Blind Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[121] Paris Smaragdis,et al. Generative Adversarial Source Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[122] Bob L. Sturm,et al. Local Interpretable Model-Agnostic Explanations for Music Content Analysis , 2017, ISMIR.

[123] Yu Tsao,et al. End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[124] Chris Donahue,et al. Synthesizing Audio with Generative Adversarial Networks , 2018, ArXiv.

[125] Tara N. Sainath,et al. Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[126] Vincent Lostanlen,et al. Deep Convolutional Networks on the Pitch Spiral For Music Instrument Recognition , 2016, ISMIR.

[127] Jonathan Le Roux,et al. Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[128] James R. Glass,et al. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[129] Chris Donahue,et al. Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[130] Björn W. Schuller,et al. Universal Onset Detection with Bidirectional Long Short-Term Memory Neural Networks , 2010, ISMIR.

[131] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[132] Zheng-Hua Tan,et al. Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[133] Yoshua Bengio,et al. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[134] Joaquín González-Rodríguez,et al. Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[135] Antonio Bonafonte,et al. SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[136] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[137] Emanuel A. P. Habets,et al. Multi-Speaker Localization Using Convolutional Neural Network Trained with Noise , 2017, ArXiv.

[138] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[139] Yu Zhang,et al. Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[140] Francesco Piazza,et al. A neural network based algorithm for speaker localization in a multi-room environment , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[141] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[142] Jeffrey L. Elman,et al. Finding Structure in Time , 1990, Cogn. Sci..

[143] Björn W. Schuller,et al. Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[144] Li-Rong Dai,et al. A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[145] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[146] Mathieu Lagrange,et al. Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[147] Tara N. Sainath,et al. Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition , 2016, INTERSPEECH.

[148] Yu Tsao,et al. Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[149] Tara N. Sainath,et al. A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[150] Juan Pablo Bello,et al. Rethinking Automatic Chord Recognition with Convolutional Neural Networks , 2012, 2012 11th International Conference on Machine Learning and Applications.

[151] DeLiang Wang,et al. Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.