-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Presently, only naive SGD has been implemented. However, momentum is an important aspect for SGD - informally, it says the velocity of updates in a particular direction will continue based on recent history.
The parameter update at each time step will be a weighted average of the parameter updates at past time steps, with the weights decayed exponentially. The higher the decay, the more the weight update at each time step will be based on the parameter's accumulated momentum as opposed to its current velocity.
Implementation
Momentum parameter should by defined by mu
and the gradient of each step is nabla_t
. The update rule is then nabla[t] + mu * nabla_[t-1] + mu^2 * nabla[t-2] + ...
.
This can be computed like so:
Iteration | Quantity |
---|---|
t1 | f(t1) = nabla_t1 |
t2 | f(t2) = nabla_t + mu * f(t1) |
t3 | f(t3) = nable_t + mu * f(t2) |
... | ... |
Changes
Define a new OptimiserType
: MomentumSGD
.
Define a new Optimiser
parameter - momentum
The class Optimiser
will require a new private member velocities of type std::vector<RowMatrixXf>
for each parameter in the neural network.
// initialise some param counter: called param here
// loop over all of the parameters in the neural network
Eigen::Ref<RowMatrixXf> paramMatrix = net->getLayers()[i] -> operations[j];
RowMatrixXf velNew(paramMatrix.rows(), paramMatrix.cols())
velNew.setZero();
velocities[param++] = velNew;
During the update the momentum parameter will need to be evaluated using
// First update the velocity by the momentum
velocities[param] *= momentum;
velocities[param] += lr * ...paramGrad.array();
...param.array() -= velocities[param];
param++;