Analyzing Upper Bounds on Mean Absolute Errors for Deep Neural Network-Based Vector-to-Vector Regression

In this paper, we show that, in vector-to-vector regression utilizing deep neural networks (DNNs), a generalized loss of mean absolute error (MAE) between the predicted and expected feature vectors is upper bounded by the sum of an approximation error, an estimation error, and an optimization error. Leveraging upon error decomposition techniques in statistical learning theory and non-convex optimization theory, we derive upper bounds for each of the three aforementioned errors and impose necessary constraints on DNN models. Moreover, we assess our theoretical results through a set of image de-noising and speech enhancement experiments. Our proposed upper bounds of MAE for DNN based vector-to-vector regression are corroborated by the experimental results and the upper bounds are valid with and without the “over-parametrization” technique.

[1]  Junichi Yamagishi,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016, SSW.

[2]  Lei Zhang,et al.  Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising , 2016, IEEE Transactions on Image Processing.

[3]  Raef Bassily,et al.  On exponential convergence of SGD in non-convex over-parametrized learning , 2018, ArXiv.

[4]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  K. Pearson Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia , 1896 .

[6]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[7]  Chin-Hui Lee,et al.  Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[9]  L. Deng,et al.  The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[10]  Jun Du,et al.  A Theory on Deep Neural Network Based Vector-to-Vector Regression With an Illustration of Its Expressive Power in Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[12]  Liwei Wang,et al.  The Expressive Power of Neural Networks: A View from the Width , 2017, NIPS.

[13]  Yamashita,et al.  Backpropagation algorithm which varies the number of hidden units , 1989 .

[14]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[15]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[16]  Xiaojin Zhu,et al.  Human Rademacher Complexity , 2009, NIPS.

[17]  Fernando Pérez-Cruz,et al.  SVM multiregression for nonlinear channel estimation in multiple-input multiple-output systems , 2004, IEEE Transactions on Signal Processing.

[18]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[19]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[20]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[21]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[22]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[23]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[24]  David Rolnick,et al.  The power of deeper networks for expressing natural functions , 2017, ICLR.

[25]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[26]  R. Srikant,et al.  Why Deep Neural Networks for Function Approximation? , 2016, ICLR.

[27]  Jinde Cao,et al.  Delta-Differentiable Weighted Pseudo-Almost Automorphicity on Time–Space Scales for a Novel Class of High-Order Competitive Neural Networks with WPAA Coefficients and Mixed Delays , 2017, Neural Processing Letters.

[28]  C. Willmott,et al.  Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance , 2005 .

[29]  Chin-Hui Lee,et al.  Tensor-To-Vector Regression for Multi-Channel Speech Enhancement Based on Tensor-Train Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Xin Wang,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016 .

[31]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[32]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[33]  Dong Wang,et al.  Subspace models for bottleneck features , 2013, INTERSPEECH.

[34]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[35]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[36]  Andrew R. Barron,et al.  Approximation and estimation bounds for artificial neural networks , 2004, Machine Learning.

[37]  Enhong Chen,et al.  Image Denoising and Inpainting with Deep Neural Networks , 2012, NIPS.

[38]  Dong Wang,et al.  Bottleneck features based on gammatone frequency cepstral coefficients , 2013, INTERSPEECH.

[39]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Liqing Zhang,et al.  Saliency Detection: A Spectral Residual Approach , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  M. Paez,et al.  Minimum Mean-Squared-Error Quantization in Speech PCM and DPCM Systems , 1972, IEEE Trans. Commun..

[42]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[43]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[44]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[45]  Thomas M. Cover,et al.  Network Information Theory , 2001 .

[46]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[47]  Concha Bielza,et al.  A survey on multi‐output regression , 2015, WIREs Data Mining Knowl. Discov..

[48]  G N Burrow,et al.  Prediction of screening decisions in a medical school admission process. , 1976, Journal of medical education.

[49]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[50]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[51]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[52]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[53]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[54]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[55]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[56]  Adnène Arbi,et al.  Dynamics of BAM neural networks with mixed delays and leakage time‐varying delays in the weighted pseudo–almost periodic on time‐space scales , 2018 .

[57]  Dimitris S. Papailiopoulos,et al.  Stability and Generalization of Learning Algorithms that Converge to Global Optima , 2017, ICML.

[58]  Chin-Hui Lee,et al.  On Mean Absolute Error for Deep Neural Network Based Vector-to-Vector Regression , 2020, IEEE Signal Processing Letters.

[59]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[60]  Cong Ma,et al.  A Selective Overview of Deep Learning , 2019, Statistical science : a review journal of the Institute of Mathematical Statistics.

[61]  Jinde Cao,et al.  Pseudo-Almost Periodic Solution on Time-Space Scales for a Novel Class of Competitive Neutral-Type Neural Networks with Mixed Time-Varying Delays and Leakage Delays , 2017, Neural Processing Letters.

[62]  Yi Jiang,et al.  Auditory features based on Gammatone filters for robust speech recognition , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).

[63]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[64]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[65]  Boris Hanin,et al.  Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations , 2017, Mathematics.

[66]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[67]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.