To define the update algorithm, there are two different methods. One is to set the desired algorithm explicitely,
adam = True or
rmsprop = True. The other method is to set the parameter
and define the type by setting
class in a dictionary. Currently available updater are:
if no updater is specified, SGD is used.
- An integer specifying the number of updates to stack the gradient, called “gradient accumulation”.
- Set to
Trueto enable adam gradient updating.
- Specifiy a gradient clipping threshold.
- Apply a (gaussian?) noise to the gradient with given deviation (variance? stddev?)
- Specifies the global learning rate
A str to define which score or error is used to control the learning rate reduction. Per default, Returnn will use dev_score_output. A typical choice would be dev_score_LAYERNAME or dev_error_LAYERNAME. Can be set to None to disable learning rate control.
- A path to a file storing the learning rate for each epoch. Despite the name, also stores scores and errors.
- A list of learning rates that defines the learning rate for each epoch from the beginning. Can be used for learning-rate warmup.