We introduce the stochastic gradient descent algorithm used in the computational network toolkit (CNTK) — a general purpose machine learning toolkit written in C++ for training and using models that can be expressed as a computational network. We describe the algorithm used to compute the gradients automatically for a given network. We also propose a low-cost automatic learning rate selection algorithm and demonstrate that it works well in practice. 1 Computational Network Toolkit A computational network (CN) is a directed graph in which each leaf represents an input value or a learnable parameter and each node represents an operator. Figure 1 illustrates an example CN of a log-linear model. Here, each node is identified by a {node name : operator type} pair and takes its ordered children as the operator’s inputs. For example, in the figure, T = Times(W,X) which is different from T = Times(X,W). A CN can have many root nodes which are used under different conditions. For example, one root node may represent a cross-entropy training criterion and another may represent an evaluation criterion. The network in Figure 1 has only one root node {C: Cross Entropy}. Many machine learning models, such as neural networks, that can be described via a series of operations, can be converted into a CN. The computational network toolkit (CNTK) is a general purpose C++ based machine learning toolkit for models that can be described as CNs. Figure 2 illustrates the architecture of CNTK. The core of CNTK is an internal representation of a CN which provides two key methods: Evaluate, which computes the value of a node given its inputs and Compute Gradient, which computes the gradient of a node with respect to its inputs. These methods are executed using an IExecutionEngine such as a CPU, a GPU, or a data flow graph such as pTask [1]. ICNBuilder reads the network description (or language) and creates a CN. IDataReader reads in features and labels stored in different formats.
[1]
Warren B. Powell,et al.
Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming
,
2006,
Machine Learning.
[2]
Tom Schaul,et al.
No more pesky learning rates
,
2012,
ICML.
[3]
Nitish Srivastava,et al.
Improving neural networks by preventing co-adaptation of feature detectors
,
2012,
ArXiv.
[4]
Brian Guenter,et al.
Efficient symbolic differentiation for graphics applications
,
2007,
SIGGRAPH 2007.
[5]
Yann LeCun,et al.
The mnist database of handwritten digits
,
2005
.
[6]
Mark Silberstein,et al.
PTask: operating system abstractions to manage GPUs as compute devices
,
2011,
SOSP.