CS256
Chris Pollett
Nov 3, 2021
epsilon := starting learning rate rho := starting decay rate alpha := momentum coefficient theta := initial weights v := initial velocity r := 0 // accumulation variable while stopping criteria not met: Sample minibatch {x[1], ..., x[m]} from training set with targets {y[1], ..., y[m]}. Compute interim update: tmpTheta := theta + alpha*v Compute gradient g := 1/m * grad_{tmpTheta} sum_i L(f([i]; tmpTheta), y[i]) Accumulate gradient: r := rho * r + (1 - rho)* hadamard (g, g) // pointwise product Update velocity: v := alpha * v - hadamard (vec (epsilon_j /sqrt(r_j)), g) //make vector of epsilons then hadamard Apply update: theta := theta + v