Improving SGD convergence by tracing multiple promising directions and estimating distance to minimum

Deep neural networks are usually trained with stochastic gradient descent (SGD), which optimizes θ ∈ R parameters to minimize objective function using very rough approximations of gradient, only averaging to the real gradient. Standard approaches like momentum or ADAM only consider single direction, and do not try to model distance from extremum neglecting valuable information from calculated gradients. It can be improved by second order methods, but they are costly, need inverse of Hessian problematic especially in the stochastic setting. Proposed general framework should overcome these difficulties by directly evolving local second order parametrization in d ≪ D directions: as ∑ d i=1 λi(θ · vi − pi) 2 modelling local information we are interested in, and relatively simple to update for better agreement with calculated gradients. It allows for θ update by simultaneously attracting toward modelled directional minima (λi > 0), and repulsing from maxima (λi < 0), correspondingly to distances from pi (and uncertainty), what allows to also handle problematic saddles. Calculated gradients can be used to slowly evolve this parametrization to improve agreement with local behavior of objective function, accumulating their statistical trends: 1) update λ, p parameters for more accurate description of parabola in corresponding directions (also uncertainty), 2) rotate considered subspace toward recently statistically significant directions (replacing the less frequent ones), and 3) rotate (vi) inside the subspace to improve diagonal form of Hessian in this basis. Presented general framework leaves many customization options for optimizations to specific tasks.