`returnn.tf.updater`¶

This module covers the optimizer (SGD, Adam, etc) logic, and model param update logic in general.

returnn.tf.updater.register_optimizer_class(cls, name=None)[source]¶

Parameters:

cls (type[Optimizer|KerasOptimizer])
name (str|None)

returnn.tf.updater.get_optimizer_class(class_name)[source]¶

Parameters:: class_name (str|function|type[Optimizer|KerasOptimizer]) – e.g. “adam”
Returns:: the class
Return type:: type[Optimizer|KerasOptimizer]

class returnn.tf.updater.Updater(config, network, initial_learning_rate=1.0)[source]¶

This will create the tf.compat.v1.train.Optimizer instance given the config and the update-op for all trainable vars. See the code of Updater.create_optimizer() for valid config options.

Wraps one or multiple tf.compat.v1.train.Optimizer, and extends it by some further functionality.

Note: Vincent Vanhoucke says, in case you get nans, consider increasing the epsilon (for Adam, Nadam and similar). This is the config option optimizer_epsilon. In some places in our Theano code, 1e-16 is our default epsilon, in some other parts, 1e-8 is. 1e-8 might be more stable. Or even 1e-6. Note that when the gradient is suddenly zero in one step, the update can be proportional to lr / eps.

From the tf.compat.v1.train.AdamOptimizer documentation:

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

More from Vincent Vanhoucke:

One thing you can do is run with a tiny learning rate, or even zero learning rate. If you still have divergence then, you have a bug in your setup. If not, increase your rate slowly and see if there is a regime in which things train without diverging. It’s completely possible to have weights that are in a good range, but activations or gradients going to infinity because of the shape of the loss, or too high a learning rate. It’s obviously always a possibility that there is a bug in the optimizers, but in my experience, every single instance of this kind of problem could be traced back to a weirdly wired model, learning rate issues, bad randomization of the input examples, or - in the case of Adam or RMSProp - issues with the epsilon value.

In addition, you might also want to try gradient_nan_inf_filter or maybe set beta1=0.5.

For further debugging, see tf.add_check_numerics_ops() or add_check_numerics_ops_and_debug_print(), which is config option debug_add_check_numerics_ops. Also relevant are config options debug_add_check_numerics_on_output and debug_grad_summaries.

Parameters:

config (returnn.config.Config)
network (TFNetwork)
initial_learning_rate (float)

reset_optim_op()[source]¶: Call this if sth is changed which the optim_op depends on. See self.create_optim_op().

set_trainable_vars(trainable_vars)[source]¶

Parameters:: trainable_vars (list[tf.Variable])

set_learning_rate(value, session)[source]¶

Parameters:

value (float)
session (tf.compat.v1.Session)

get_current_step_learning_rate()[source]¶

Return type:: tf.Tensor

create_optim_op()[source]¶

Creates the optimize TF op.

Returns:: nothing, will just set self.optim_op

get_optim_op(callback_on_new=None)[source]¶

Parameters:: callback_on_new (None|()->None)
Return type:: tf.Operation

init_optimizer_vars(session)[source]¶

Parameters:: session (tf.compat.v1.Session)

get_default_optimizer()[source]¶

Return type:: tf.compat.v1.train.Optimizer

get_default_optimizer_item(auto_create_new)[source]¶

Parameters:: auto_create_new (bool)
Returns:: key, optimizer
Return type:: (object, tf.compat.v1.train.Optimizer)

create_all_needed_optimizers(train_vars)[source]¶

Parameters:: train_vars (list[tf.Variable])

get_slot_names_per_optimizer()[source]¶

Returns:: ordered dict: opt key -> slot names
Return type:: dict[object, list[str]]

filter_var_list_per_optimizer_key(var_list, opt_key)[source]¶

Parameters:

var_list (list[tf.Variable])
opt_key (object) – should be in self.optimizer

Return type:

list[tf.Variable]

get_slot(var, name)[source]¶

Parameters:

var (tf.Variable)
name (str)

Return type:

tf.Variable|None

get_apply_grads_op(loss, var_list)[source]¶

Parameters:

loss (tf.Tensor)
var_list (list[tf.Variable])

Returns:

op with all variable updates combined, using the optimizer

Return type:

tf.Operation

returnn.tf.updater.accum_grad_multiple_step(grad, var, train_step, num_accum_steps)[source]¶

Parameters:

grad (tf.Tensor|tf.IndexedSlices)
var (tf.Variable)
train_step (tf.Tensor) – int, scalar
num_accum_steps (int)

Returns:

modified grad

Return type:

tf.Tensor

class returnn.tf.updater.BaseCustomOptimizer(learning_rate, use_locking=False, name=None)[source]¶

Base class for our own optimizer implementations. This simplifies the interface to be implemented a bit from Optimizer. You just have to implement _apply() here. See CustomGradientDescentOptimizer or CustomAdamOptimizer for as an example.

Construct a new optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning: rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.

class returnn.tf.updater.CustomGradientDescentOptimizer(learning_rate, use_locking=False, name=None)[source]¶

Just an example implementation for simple gradient descent.

Construct a new optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning: rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.

class returnn.tf.updater.NormalizedSGD(learning_rate, use_locking=False, name=None)[source]¶

All grads are L2 normalized (via tf.nn.l2_normalize()), otherwise it’s standard SGD. Via: https://github.com/kmkolasinski/deep-learning-notes/tree/master/max-normed-optimizer

Construct a new optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning: rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.

class returnn.tf.updater.NeuralOptimizer1(beta1=0.9, decrease_factor=0.1, **kwargs)[source]¶

Via Neural Optimizer Search with Reinforcement Learning (https://proceedings.mlr.press/v70/bello17a/bello17a.pdf).

Equivalent to the optimizer g * exp(sign(g) * sign(m)), we use:

g * where(sign(g) == sign(m), 1.0, decrease_factor)

where m is the running average of g.

Calculation of m: m_t <- beta1 * m_{t-1} + (1 - beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9

Parameters:

beta1 (float) – used for the running average of m
decrease_factor (float) – in the original paper, it is e^-2 ~= 0.135

class returnn.tf.updater.GradVarianceScaledOptimizer(beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)[source]¶

Let m be the running average of g. Calculation of m: m_t <- beta1 * m_{t-1} + (1 - beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9

Let v be the running average of the variance of g, i.e. of (g - m)^2.

Parameters:

beta1 (float) – used for the running average of g (m)
beta2 (float) – used for the running average of variance of g (v)
epsilon (float)

class returnn.tf.updater.NadamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')[source]¶

Optimizer that implements the Nadam algorithm. See [Dozat, T., 2015](http://cs229.stanford.edu/proj2015/054_report.pdf).

Copied from: https://github.com/tensorflow/tensorflow/blob/v1.15.5/tensorflow/contrib/opt/python/training/nadam_optimizer.py

We have this here to have this Nadam variant available in TF 2 because the Keras Nadam behaves a bit different. https://github.com/rwth-i6/returnn/issues/766 https://github.com/tensorflow/tensorflow/issues/53204

We can still use this old code because the underlying kernel still supports the use_nesterov option.

Construct a new Adam optimizer.

Initialization:

$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$

The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper:

$$t := t + 1$$ $$text{lr}_t := mathrm{learning_rate} *

sqrt{1 - beta_2^t} / (1 - beta_1^t)$$

$$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$text{variable} := text{variable} -

text{lr}_t * m_t / (sqrt{v_t} + epsilon)$$

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).

Args:

learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay

rate for the 1st moment estimates.

beta2: A float value or a constant float tensor. The exponential decay: rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability. This epsilon is: “epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.

use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “Adam”.

class returnn.tf.updater.CustomAdamOptimizer(beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)[source]¶

Reimplementation of Adam. See also tf.compat.v1.train.AdamOptimizer.

``` t <- t + 1 lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon) ```

Parameters:

beta1 (float) – used for the running average of g (m)
beta2 (float) – used for the running average of g*g (v)
epsilon (float)

class returnn.tf.updater.AMSGradOptimizer(learning_rate=0.001, decay=False, beta1=0.9, beta2=0.99, epsilon=0.0, var_list=())[source]¶

https://colab.research.google.com/notebook#fileId=1xXFAuHM2Ae-OmF5M8Cn9ypGCa_HHBgfG&scrollTo=N1-2wPHN1Otn https://openreview.net/pdf?id=ryQu7f-RZ https://keras.io/optimizers/ https://ruder.io/deep-learning-optimization-2017/index.html#fixingtheexponentialmovingaverage https://github.com/taki0112/AMSGrad-Tensorflow

Create a new Optimizer.

This must be called by the constructors of subclasses.

Args:

use_locking: Bool. If True apply use locks to prevent concurrent updates: to variables.
name: A non-empty string. The name to use for accumulators created: for the optimizer.

Raises:

ValueError: If name is malformed.

apply_gradients(gradient_variables)[source]¶

Parameters:: gradient_variables (list[(tf.Tensor,tf.Variable)])
Return type:: tf.Operation

returnn.tf.updater¶

`returnn.tf.updater`¶