Stochastic Gradient Descent Algorithm in the Computational Network Toolkit

We introduce the stochastic gradient descent algorithm used in the computational network toolkit (CNTK) — a general purpose machine learning toolkit written in C++ for training and using models that can be expressed as a computational network. We describe the algorithm used to compute the gradients automatically for a given network. We also propose a low-cost automatic learning rate selection algorithm and demonstrate that it works well in practice. 1 Computational Network Toolkit A computational network (CN) is a directed graph in which each leaf represents an input value or a learnable parameter and each node represents an operator. Figure 1 illustrates an example CN of a log-linear model. Here, each node is identified by a {node name : operator type} pair and takes its ordered children as the operator’s inputs. For example, in the figure, T = Times(W,X) which is different from T = Times(X,W). A CN can have many root nodes which are used under different conditions. For example, one root node may represent a cross-entropy training criterion and another may represent an evaluation criterion. The network in Figure 1 has only one root node {C: Cross Entropy}. Many machine learning models, such as neural networks, that can be described via a series of operations, can be converted into a CN. The computational network toolkit (CNTK) is a general purpose C++ based machine learning toolkit for models that can be described as CNs. Figure 2 illustrates the architecture of CNTK. The core of CNTK is an internal representation of a CN which provides two key methods: Evaluate, which computes the value of a node given its inputs and Compute Gradient, which computes the gradient of a node with respect to its inputs. These methods are executed using an IExecutionEngine such as a CPU, a GPU, or a data flow graph such as pTask [1]. ICNBuilder reads the network description (or language) and creates a CN. IDataReader reads in features and labels stored in different formats.