To define the update algorithm, set the parameter
optimizer to a dictionary
and define the type by setting
All available optimizers and their parameters can be found here.
Setting the learning rate should not set in the dict, but rather separately.
If no updater is specified, plain SGD is used.
The learning rate control scheme is set with
and many possible settings are available for the different control schemes.
For the default values have a look at
RETURNN will override the optimizer epsilon with 1e-16 if not specified otherwise, this can lead to unwanted behaviour when assuming a default epsilon of e.g. 1e-8 for Adam.
An integer specifying the number of updates to stack the gradient, called “gradient accumulation”.
This can be set as either a dictionary or a function.
When setting a dictionary, a cyclic learning rate can be implemented by setting the parameters
decay. The global learning rate is then multiplied by
decay ** (global_step % interval)
When using a custom function, the passed parameters are
network, global_train_step` and
learning_rate. Do not forget to mark the parameters as variable args and add
**kwargsto keep the config compatible to future changes.
An example for Noam-style learning rate scheduling would be:
learning_rate = 1 # can be higher, reasonable values may be up to 10 or even more learning_rate_control = "constant" def noam(n, warmup_n, model_d): """ Noam style learning rate scheduling (k is identical to the global learning rate) :param int|float|tf.Tensor n: :param int|float|tf.Tensor warmup_n: :param int|float|tf.Tensor model_d: :return: """ from returnn.tf.compat import v1 as tf model_d = tf.cast(model_d, tf.float32) n = tf.cast(n, tf.float32) warmup_n = tf.cast(warmup_n, tf.float32) return tf.pow(model_d, -0.5) * tf.minimum(tf.pow(n, -0.5), n * tf.pow(warmup_n, -1.5)) def dynamic_learning_rate(*, network, global_train_step, learning_rate, **kwargs): """ :param TFNetwork network: :param tf.Tensor global_train_step: :param tf.Tensor learning_rate: current global learning rate :param kwargs: :return: """ WARMUP_N = 25000 MODEL_D = 512 return learning_rate * noam(n=global_train_step, warmup_n=WARMUP_N, model_d=MODEL_D)
Specify a gradient clipping threshold.
Apply a (gaussian?) noise to the gradient with given deviation (variance? stddev?)
Specifies the global learning rate
A list of learning rates that defines the learning rate for each epoch from the beginning. Can be used for learning-rate warmup.
This defines which type of learning rate control mechanism is used. Possible values are:
constantfor a constant learning rate which is never modified
newbob_absfor a scheduling based on absolute improvement
newbob_relfor a scheduling based on relative improvement
newbob_multi_epochfor a scheduling based on relative improvement averaged over multiple epochs
Please also look at setting values with the
newbobprefix for further customization
A str to define which score or error is used to control the learning rate reduction. Per default, Returnn will use dev_score_output. A typical choice would be dev_score_LAYERNAME or dev_error_LAYERNAME. Can be set to None to disable learning rate control.
The number of epochs after the last update that the learning rate is kept constant.
If true, the relative error is scaled with the ratio of the default learning rate divided by the current learning rate. Can be used with
A path to a file storing the learning rate for each epoch. Despite the name, also stores scores and errors.
Specifies the minimum learning rate.
This is the absolute improvement that has to be achieved in order to _not_ reduce the learning rate. Can be used with
newbob_abs. The value can be positive or negative.
The scaling factor for the learning rate when a reduction is applied. This parameter is available for all
The number of epochs the improvement is averaged over.
The number of steps after which the learning rate is updated. This is set equal to
newbob_multi_num_epochswhen not specified.
This is the relative improvement that has to be achieved in order to _not_ reduce the learning rate. Can be used with
newbob_multi_epoch. The value can be positive or negative.
A dictionary with a
classentry for the optimizer. Other keys are passed as parameters to the constructor of the optimizer class.
If true the relative error is computed by dividing the error difference by the old error value instead of the current error value.
The number of epochs after which the internal states of all optimizers will be resetted to their initial state.
If true, use the learning rate control scheme also during pre-training.