论文信息 - Stochastic Gradient Descent Algorithm in the Computational Network Toolkit

Stochastic Gradient Descent Algorithm in the Computational Network Toolkit

We introduce the stochastic gradient descent algorithm used in the computational network toolkit (CNTK) — a general purpose machine learning toolkit written in C++ for training and using models that can be expressed as a computational network. We describe the algorithm used to compute the gradients automatically for a given network. We also propose a low-cost automatic learning rate selection algorithm and demonstrate that it works well in practice. 1 Computational Network Toolkit A computational network (CN) is a directed graph in which each leaf represents an input value or a learnable parameter and each node represents an operator. Figure 1 illustrates an example CN of a log-linear model. Here, each node is identified by a {node name : operator type} pair and takes its ordered children as the operator’s inputs. For example, in the figure, T = Times(W,X) which is different from T = Times(X,W). A CN can have many root nodes which are used under different conditions. For example, one root node may represent a cross-entropy training criterion and another may represent an evaluation criterion. The network in Figure 1 has only one root node {C: Cross Entropy}. Many machine learning models, such as neural networks, that can be described via a series of operations, can be converted into a CN. The computational network toolkit (CNTK) is a general purpose C++ based machine learning toolkit for models that can be described as CNs. Figure 2 illustrates the architecture of CNTK. The core of CNTK is an internal representation of a CN which provides two key methods: Evaluate, which computes the value of a node given its inputs and Compute Gradient, which computes the gradient of a node with respect to its inputs. These methods are executed using an IExecutionEngine such as a CPU, a GPU, or a data flow graph such as pTask [1]. ICNBuilder reads the network description (or language) and creates a CN. IDataReader reads in features and labels stored in different formats.

Dong Yu | Oleksii Kuchaiev | Brian Guenter | Michael L. Seltzer | Adam Eversole

[1] Warren B. Powell,et al. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[2] Tom Schaul,et al. No more pesky learning rates , 2012, ICML.

[3] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[4] Brian Guenter,et al. Efficient symbolic differentiation for graphics applications , 2007, SIGGRAPH 2007.

[5] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .

[6] Mark Silberstein,et al. PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.