Implement loss momentum for optimisation

Presently, only naive SGD has been implemented. However, momentum is an important aspect for SGD - informally, it says the _velocity_ of updates in a particular direction will continue based on recent history. 

The parameter update at each time step will be a weighted average of the parameter updates at past time steps, with the weights decayed exponentially. The higher the decay, the more the weight update at each time step will be based on the parameter's accumulated momentum as opposed to its current velocity. 

**Implementation**
 Momentum parameter should by defined by `mu` and the gradient of each step is `nabla_t`. The update rule is then `nabla[t] + mu * nabla_[t-1] + mu^2 * nabla[t-2] + ...`. 

This can be computed like so:
| Iteration | Quantity |
|---------|---------|
| t1 | f(t1) = nabla_t1 |
| t2 | f(t2) = nabla_t + mu * f(t1)|
| t3 | f(t3) = nable_t + mu * f(t2)|
| ... | ... | 

**Changes**

Define a new `OptimiserType` : `MomentumSGD`.
Define a new `Optimiser` parameter - momentum 

The class `Optimiser` will require a new private member velocities of type `std::vector<RowMatrixXf>` for each parameter in the neural network.  
```C++
// initialise some param counter: called param here
// loop over all of the parameters in the neural network
Eigen::Ref<RowMatrixXf> paramMatrix = net->getLayers()[i] -> operations[j];
RowMatrixXf velNew(paramMatrix.rows(), paramMatrix.cols())
velNew.setZero();
velocities[param++] = velNew;
```

During the update the momentum parameter will need to be evaluated using
```C++
// First update the velocity by the momentum 
velocities[param] *= momentum;
velocities[param] += lr * ...paramGrad.array();
...param.array() -= velocities[param];
param++;
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement loss momentum for optimisation #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Iteration	Quantity
t1	f(t1) = nabla_t1
t2	f(t2) = nabla_t + mu * f(t1)
t3	f(t3) = nable_t + mu * f(t2)
...	...

Implement loss momentum for optimisation #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions