Overparameterization: A Connection Between Software 1.0 and Software 2.0

A new ecosystem of machine-learning driven applications, titled Software 2.0, has arisen that integrates neural networks into a variety of computational tasks. Such applications include image recognition, natural language processing, and other traditional machine learning tasks. However, these techniques have also grown to include other structured domains, such as program analysis and program optimization for which novel, domain-specific insights mate with model design. In this paper, we connect the world of Software 2.0 with that of traditional software - Software 1.0 - through overparameterization: a program may provide more computational capacity and precision than is necessary for the task at hand. In Software 2.0, overparamterization - when a machine learning model has more parameters than datapoints in the dataset - arises as a contemporary understanding of the ability for modern, gradient-based learning methods to learn models over complex datasets with high-accuracy. Specifically, the more parameters a model has, the better it learns. In Software 1.0, the results of the approximate computing community show that traditional software is also overparameterized in that software often simply computes results that are more precise than is required by the user. Approximate computing exploits this overparameterization to improve performance by eliminating unnecessary, excess computation. For example, one - of many techniques - is to reduce the precision of arithmetic in the application. In this paper, we argue that the gap between available precision and that that is required for either Software 1.0 or Software 2.0 is a fundamental aspect of software design that illustrates the balance between software designed for general-purposes and domain-adapted solutions. A general-purpose solution is easier to develop and maintain versus a domain-adapted solution. However, that ease comes at the expense of performance. We show that the approximate computing community and the machine learning community have developed overlapping techniques to improve performance by reducing overparameterization. We also show that because of these shared techniques, questions, concerns, and answers on how to construct software can translate from one software variant to the other.

[1]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[2]  Henry Hoffmann,et al.  Quality of service profiling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[3]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[4]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[5]  Henry Hoffmann,et al.  Managing performance vs. accuracy trade-offs with loop perforation , 2011, ESEC/FSE '11.

[6]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[7]  Daniel M. Roy,et al.  Probabilistically Accurate Program Transformations , 2011, SAS.

[8]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[9]  Michael Carbin,et al.  Sound and robust solid modeling via exact real arithmetic and continuity , 2019, Proc. ACM Program. Lang..

[10]  Jianxin Wu,et al.  ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[12]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[13]  Olatunji Ruwase,et al.  Optimizing CNNs on Multicores for Scalability, Performance and Goodput , 2017, ASPLOS.

[14]  Daniel Kang,et al.  Model Assertions for Debugging Machine Learning , 2018 .

[15]  Martin Rinard,et al.  Reliability-Aware Optimization of Approximate Computational Kernels with Rely , 2014 .

[16]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[17]  Vivienne Sze,et al.  Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Eric Atkinson,et al.  Towards Correct-by-Construction Probabilistic Inference , 2016 .

[19]  Martin C. Rinard,et al.  Probabilistic modeling and inference are becoming central computational tools across a broad range of fields , 2018 .

[20]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[21]  Sumit Gulwani,et al.  Proving programs robust , 2011, ESEC/FSE '11.

[22]  Martin Rinard,et al.  Acceptability-oriented computing , 2003, SIGP.

[23]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[24]  James Demmel,et al.  Precimonious: Tuning assistant for floating-point precision , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[26]  Joshua B. Tenenbaum,et al.  Church: a language for generative models , 2008, UAI.

[27]  Eva Darulova,et al.  Sound Mixed-Precision Optimization with Rewriting , 2017, 2018 ACM/IEEE 9th International Conference on Cyber-Physical Systems (ICCPS).

[28]  Henry Hoffmann,et al.  Dynamic knobs for responsive power-aware computing , 2011, ASPLOS XVI.

[29]  Quinn Jacobson,et al.  ERSA: error resilient system architecture for probabilistic applications , 2010, DATE 2010.

[30]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[31]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[32]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[33]  Dan Alistarh,et al.  Model compression via distillation and quantization , 2018, ICLR.

[34]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[35]  Douglas L. Jones,et al.  Scalable stochastic processors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[36]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[37]  Adam Chlipala,et al.  Computable decision making on the reals and other spaces: via partiality and nondeterminism , 2018, LICS.

[38]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..