Optimizer¶

This is a list of all optimizers that can be used with RETURNN. If you are looking on how to set the optimizer correctly in the RETURNN config, please have a look at the optimizer settings.

Adadelta¶

class tensorflow.python.training.adadelta.AdadeltaOptimizer(learning_rate=0.001, rho=0.95, epsilon=1e-08, use_locking=False, name='Adadelta')[source]¶

Optimizer that implements the Adadelta algorithm.

References:

ADADELTA - An Adaptive Learning Rate Method:: [Zeiler, 2012](http://arxiv.org/abs/1212.5701) ([pdf](http://arxiv.org/pdf/1212.5701v1.pdf))

@compatibility(TF2) tf.compat.v1.train.AdadeltaOptimizer is compatible with eager mode and tf.function. When eager execution is enabled, learning_rate, rho, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions.

To switch to native TF2 style, use [tf.keras.optimizers.Adadelta] (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adadelta) instead. Please notice that due to the implementation differences, tf.keras.optimizers.Adadelta and tf.compat.v1.train.AdadeltaOptimizer may have slight differences in floating point numerics even though the formula used for the variable updates still matches.

#### Structural mapping to native TF2

Before:

```python optimizer = tf.compat.v1.train.AdadeltaOptimizer(

learning_rate=learning_rate, rho=rho, epsilon=epsilon)

```

After:

```python optimizer = tf.keras.optimizers.Adadelta(

learning_rate=learning_rate, rho=rho, epsilon=epsilon)

```

#### Before & after usage example Before:

`python x = tf.Variable([1,2,3], dtype=tf.float32) grad = tf.constant([0.1, 0.2, 0.3]) optimizer = tf.compat.v1.train.AdadeltaOptimizer(learning_rate=0.001) optimizer.apply_gradients(zip([grad], [x])) `

After:

`python x = tf.Variable([1,2,3], dtype=tf.float32) grad = tf.constant([0.1, 0.2, 0.3]) optimizer = tf.keras.optimizers.Adadelta(learning_rate=0.001) optimizer.apply_gradients(zip([grad], [x])) `

@end_compatibility

Construct a new Adadelta optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate.: To match the exact form in the original paper use 1.0.

rho: A Tensor or a floating point value. The decay rate. epsilon: A Tensor or a floating point value. A constant epsilon used

to better conditioning the grad update.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Adadelta”.

Adagrad¶

class tensorflow.python.training.adagrad.AdagradOptimizer(learning_rate, initial_accumulator_value=0.1, use_locking=False, name='Adagrad')[source]¶

Optimizer that implements the Adagrad algorithm.

References:

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization: :[Duchi et al., 2011](http://jmlr.org/papers/v12/duchi11a.html) ([pdf](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf))

@compatibility(TF2) tf.compat.v1.train.AdagradOptimizer is compatible with eager mode and tf.function. When eager execution is enabled, learning_rate, initial_accumulator_value, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions.

To switch to native TF2 style, use [tf.keras.optimizers.Adagrad] (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adagrad) instead. Please notice that due to the implementation differences, tf.keras.optimizers.Adagrad and tf.compat.v1.train.AdagradOptimizer may have slight differences in floating point numerics even though the formula used for the variable updates still matches.

#### Structural mapping to native TF2

Before:

```python optimizer = tf.compat.v1.train.AdagradOptimizer(

learning_rate=learning_rate, initial_accumulator_value=initial_accumulator_value)

```

After:

```python optimizer = tf.keras.optimizers.Adagrad(

learning_rate=learning_rate, initial_accumulator_value=initial_accumulator_value, epsilon=1e-07)

```

#### Before & after usage example Before:

`python x = tf.Variable([1,2,3], dtype=tf.float32) grad = tf.constant([0.1, 0.2, 0.3]) optimizer = tf.compat.v1.train.AdagradOptimizer(learning_rate=0.001) optimizer.apply_gradients(zip([grad], [x])) `

After:

`python x = tf.Variable([1,2,3], dtype=tf.float32) grad = tf.constant([0.1, 0.2, 0.3]) optimizer = tf.keras.optimizers.Adagrad(learning_rate=0.001) optimizer.apply_gradients(zip([grad], [x])) `

@end_compatibility

Construct a new Adagrad optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. initial_accumulator_value: A floating point value.

Starting value for the accumulators, must be positive.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Adagrad”.

Raises:

ValueError: If the initial_accumulator_value is invalid.

AdagradDA¶

class tensorflow.python.training.adagrad_da.AdagradDAOptimizer(learning_rate, global_step, initial_gradient_squared_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='AdagradDA')[source]¶

Adagrad Dual Averaging algorithm for sparse linear models.

This optimizer takes care of regularization of unseen features in a mini batch by updating them when they are seen with a closed form update rule that is equivalent to having updated them on every mini-batch.

AdagradDA is typically used when there is a need for large sparsity in the trained model. This optimizer only guarantees sparsity for linear models. Be careful when using AdagradDA for deep networks as it will require careful initialization of the gradient accumulators for it to train.

References:

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization: :[Duchi et al., 2011](http://jmlr.org/papers/v12/duchi11a.html) ([pdf](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf))

Construct a new AdagradDA optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. global_step: A Tensor containing the current training step number. initial_gradient_squared_accumulator_value: A floating point value.

Starting value for the accumulators, must be positive.

l1_regularization_strength: A float value, must be greater than or: equal to zero.
l2_regularization_strength: A float value, must be greater than or: equal to zero.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “AdagradDA”.

Raises:

ValueError: If the initial_gradient_squared_accumulator_value is invalid.

Adam¶

class tensorflow.python.training.adam.AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')[source]¶

Optimizer that implements the Adam algorithm.

References:

Adam - A Method for Stochastic Optimization:: [Kingma et al., 2015](https://arxiv.org/abs/1412.6980) ([pdf](https://arxiv.org/pdf/1412.6980.pdf))

@compatibility(TF2) tf.compat.v1.train.AdamOptimizer is compatible with eager mode and tf.function. When eager execution is enabled, learning_rate, beta1, beta2, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions.

To switch to native TF2 style, use [tf.keras.optimizers.Adam] (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) instead. Please notice that due to the implementation differences, tf.keras.optimizers.Adam and tf.compat.v1.train.AdamOptimizer may have slight differences in floating point numerics even though the formula used for the variable updates still matches.

#### Structural Mapping to Native TF2

Before:

`python optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=0.001) `

After:

`python optimizer = tf.keras.optimizers.Adam(learning_rate=0.001) `

#### How to Map Arguments |TF1 Arg Name |TF2 Arg Name |Note | |----------------------|————-|----------------------| |learning_rate |learning_rate|Be careful of setting learning_rate as a : : : tensor value computed from the global : : : step. In TF1 this was usually meant to : : : imply a dynamic learning rate and would : : : recompute in each step. In TF2 (eager + : : : function) it will treat it as a scalar : : : value that only gets computed once : : : instead of a symbolic placeholder to be : : : computed each time. : |beta1 |beta_1 | | |beta2 |beta_2 | | |epsilon |epsilon | Default value is 1e-08 in TF1, but : : : 1e-07 in TF2. : |use_locking |N/A |Not applicable in TF2. |

#### Before & After Usage Example Before:

`python x = tf.Variable([1,2,3], dtype=tf.float32) grad = tf.constant([0.1, 0.2, 0.3]) optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=0.001) optimizer.apply_gradients(zip([grad], [x])) `

After:

`python x = tf.Variable([1,2,3], dtype=tf.float32) grad = tf.constant([0.1, 0.2, 0.3]) optimizer = tf.keras.optimizers.Adam(learning_rate=0.001) optimizer.apply_gradients(zip([grad], [x])) `

@end_compatibility

Construct a new Adam optimizer.

Initialization:

$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$

The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper:

$$t := t + 1$$ $$text{lr}_t := mathrm{learning_rate} *

sqrt{1 - beta_2^t} / (1 - beta_1^t)$$

$$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$text{variable} := text{variable} -

text{lr}_t * m_t / (sqrt{v_t} + epsilon)$$

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).

Args:

learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay

rate for the 1st moment estimates.

beta2: A float value or a constant float tensor. The exponential decay: rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability. This epsilon is: “epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.

use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “Adam”.

Adamax¶

class keras.optimizers.optimizer_experimental.adamax.Adamax(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, weight_decay=None, clipnorm=None, clipvalue=None, global_clipnorm=None, use_ema=False, ema_momentum=0.99, ema_overwrite_frequency=None, jit_compile=True, name='Adamax', **kwargs)[source]¶

Optimizer that implements the Adamax algorithm.

Adamax, a variant of Adam based on the infinity norm, is a first-order gradient-based optimization method. Due to its capability of adjusting the learning rate based on data characteristics, it is suited to learn time-variant process, e.g., speech data with dynamically changed noise conditions. Default parameters follow those provided in the paper (see references below).

Initialization:

`python m = 0 # Initialize initial 1st moment vector u = 0 # Initialize the exponentially weighted infinity norm t = 0 # Initialize timestep `

The update rule for parameter w with gradient g is described at the end of section 7.1 of the paper (see the referenece section):

`python t += 1 m = beta1 * m + (1 - beta) * g u = max(beta2 * u, abs(g)) current_lr = learning_rate / (1 - beta1 ** t) w = w - current_lr * m / (u + epsilon) `

Args:

learning_rate: A tf.Tensor, floating point value, a schedule that is a: tf.keras.optimizers.schedules.LearningRateSchedule, or a callable that takes no arguments and returns the actual value to use. The learning rate. Defaults to 0.001.
beta_1: A float value or a constant float tensor. The exponential decay: rate for the 1st moment estimates.
beta_2: A float value or a constant float tensor. The exponential decay: rate for the exponentially weighted infinity norm.

epsilon: A small constant for numerical stability. name: String. The name to use

for momentum accumulator weights created by the optimizer.

weight_decay: Float, defaults to None. If set, weight decay is applied. clipnorm: Float. If set, the gradient of each weight is individually

clipped so that its norm is no higher than this value.

clipvalue: Float. If set, the gradient of each weight is clipped to be no

higher than this value.

global_clipnorm: Float. If set, the gradient of all weights is clipped so

that their global norm is no higher than this value.

use_ema: Boolean, defaults to False. If True, exponential moving average

(EMA) is applied. EMA consists of computing an exponential moving average of the weights of the model (as the weight values change after each training batch), and periodically overwriting the weights with their moving average.

ema_momentum: Float, defaults to 0.99. Only used if use_ema=True. This is # noqa: E501

the momentum to use when computing the EMA of the model’s weights: new_average = ema_momentum * old_average + (1 - ema_momentum) * current_variable_value.

ema_overwrite_frequency: Int or None, defaults to None. Only used if

use_ema=True. Every ema_overwrite_frequency steps of iterations, we overwrite the model variable by its moving average. If None, the optimizer # noqa: E501

does not overwrite model variables in the middle of training, and you

need to explicitly overwrite the variables at the end of training by calling optimizer.finalize_variable_values() (which updates the model # noqa: E501 variables in-place). When using the built-in fit() training loop, this happens automatically after the last epoch, and you don’t need to do anything.

jit_compile: Boolean, defaults to True. If True, the optimizer will use XLA # noqa: E501

compilation. If no GPU device is found, this flag will be ignored.

**kwargs: keyword arguments only used for backward compatibility.

Reference:

[Kingma et al., 2014](http://arxiv.org/abs/1412.6980)

Create a new Optimizer.

build(var_list)[source]¶

Initialize optimizer variables.

Adamax optimizer has 2 types of variables: momentums (denoted as m), exponentially weighted infinity norm (denoted as u).

Args:: var_list: list of model variables to build Adamax variables on.

update_step(gradient, variable)[source]¶: Update step given gradient and the associated model variable.

get_config()[source]¶

Returns the config of the optimizer.

An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.

Subclass optimizer should override this method to include other hyperparameters.

Returns:: Python dictionary.

AMSGrad¶

class returnn.tf.updater.AMSGradOptimizer(learning_rate=0.001, decay=False, beta1=0.9, beta2=0.99, epsilon=0.0, var_list=())[source]¶

https://colab.research.google.com/notebook#fileId=1xXFAuHM2Ae-OmF5M8Cn9ypGCa_HHBgfG&scrollTo=N1-2wPHN1Otn https://openreview.net/pdf?id=ryQu7f-RZ https://keras.io/optimizers/ https://ruder.io/deep-learning-optimization-2017/index.html#fixingtheexponentialmovingaverage https://github.com/taki0112/AMSGrad-Tensorflow

Create a new Optimizer.

This must be called by the constructors of subclasses.

Args:

use_locking: Bool. If True apply use locks to prevent concurrent updates: to variables.
name: A non-empty string. The name to use for accumulators created: for the optimizer.

Raises:

ValueError: If name is malformed.

apply_gradients(gradient_variables)[source]¶

Parameters:: gradient_variables (list[(tf.Tensor,tf.Variable)])
Return type:: tf.Operation

BaseCustom¶

class returnn.tf.updater.BaseCustomOptimizer(learning_rate, use_locking=False, name=None)[source]¶

Base class for our own optimizer implementations. This simplifies the interface to be implemented a bit from Optimizer. You just have to implement _apply() here. See CustomGradientDescentOptimizer or CustomAdamOptimizer for as an example.

Construct a new optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning: rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.

CustomAdam¶

class returnn.tf.updater.CustomAdamOptimizer(beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)[source]¶

Reimplementation of Adam. See also tf.compat.v1.train.AdamOptimizer.

``` t <- t + 1 lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon) ```

Parameters:

beta1 (float) – used for the running average of g (m)
beta2 (float) – used for the running average of g*g (v)
epsilon (float)

CustomGradientDescent¶

class returnn.tf.updater.CustomGradientDescentOptimizer(learning_rate, use_locking=False, name=None)[source]¶

Just an example implementation for simple gradient descent.

Construct a new optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning: rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.

Ftrl¶

class tensorflow.python.training.ftrl.FtrlOptimizer(learning_rate, learning_rate_power=-0.5, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='Ftrl', accum_name=None, linear_name=None, l2_shrinkage_regularization_strength=0.0, beta=None)[source]¶

Optimizer that implements the FTRL algorithm.

This version has support for both online L2 (McMahan et al., 2013) and shrinkage-type L2, which is the addition of an L2 penalty to the loss function.

References:

Ad-click prediction:: [McMahan et al., 2013](https://dl.acm.org/citation.cfm?id=2488200) ([pdf](https://dl.acm.org/ft_gateway.cfm?id=2488200&ftid=1388399&dwn=1&CFID=32233078&CFTOKEN=d60fe57a294c056a-CB75C374-F915-E7A6-1573FBBC7BF7D526))

Construct a new FTRL optimizer.

Args:

learning_rate: A float value or a constant float Tensor. learning_rate_power: A float value, must be less or equal to zero.

Controls how the learning rate decreases during training. Use zero for a fixed learning rate. See section 3.1 in (McMahan et al., 2013).

initial_accumulator_value: The starting value for accumulators.: Only zero or positive values are allowed.
l1_regularization_strength: A float value, must be greater than or: equal to zero.
l2_regularization_strength: A float value, must be greater than or: equal to zero.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Ftrl”.

accum_name: The suffix for the variable that keeps the gradient squared

accumulator. If not present, defaults to name.

linear_name: The suffix for the variable that keeps the linear gradient

accumulator. If not present, defaults to name + “_1”.

l2_shrinkage_regularization_strength: A float value, must be greater than

or equal to zero. This differs from L2 above in that the L2 above is a stabilization penalty, whereas this L2 shrinkage is a magnitude penalty. The FTRL formulation can be written as: w_{t+1} = argmin_w(hat{g}_{1:t}w + L1*||w||_1 + L2*||w||_2^2), where hat{g} = g + (2*L2_shrinkage*w), and g is the gradient of the loss function w.r.t. the weights w. Specifically, in the absence of L1 regularization, it is equivalent to the following update rule: w_{t+1} = w_t - lr_t / (beta + 2*L2*lr_t) * g_t -

2*L2_shrinkage*lr_t / (beta + 2*L2*lr_t) * w_t

where lr_t is the learning rate at t. When input is sparse shrinkage will only happen on the active weights.

beta: A float value; corresponds to the beta parameter in the paper.

Raises:

ValueError: If one of the arguments is invalid.

References:

Ad-click prediction:: [McMahan et al., 2013](https://dl.acm.org/citation.cfm?id=2488200) ([pdf](https://dl.acm.org/ft_gateway.cfm?id=2488200&ftid=1388399&dwn=1&CFID=32233078&CFTOKEN=d60fe57a294c056a-CB75C374-F915-E7A6-1573FBBC7BF7D526))

GradientDescent¶

class tensorflow.python.training.gradient_descent.GradientDescentOptimizer(learning_rate, use_locking=False, name='GradientDescent')[source]¶

Optimizer that implements the gradient descent algorithm.

Construct a new gradient descent optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning: rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “GradientDescent”.

@compatibility(eager) When eager execution is enabled, learning_rate can be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

GradVarianceScaled¶

class returnn.tf.updater.GradVarianceScaledOptimizer(beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)[source]¶

Let m be the running average of g. Calculation of m: m_t <- beta1 * m_{t-1} + (1 - beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9

Let v be the running average of the variance of g, i.e. of (g - m)^2.

Parameters:

beta1 (float) – used for the running average of g (m)
beta2 (float) – used for the running average of variance of g (v)
epsilon (float)

Momentum¶

class tensorflow.python.training.momentum.MomentumOptimizer(learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=False)[source]¶

Optimizer that implements the Momentum algorithm.

Computes (if use_nesterov = False):

` accumulation = momentum * accumulation + gradient variable -= learning_rate * accumulation `

Note that in the dense version of this algorithm, accumulation is updated and applied regardless of a gradient’s value, whereas the sparse version (when the gradient is an IndexedSlices, typically because of tf.gather or an embedding) only updates variable slices and corresponding accumulation terms when that part of the variable was used in the forward pass.

@compatibility(TF2) tf.compat.v1.train.MomentumOptimizer is compatible with eager mode and tf.function. When eager execution is enabled, learning_rate,`momentum`, can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions.

To switch to native TF2 style, please directly use [tf.keras.optimizers.SGD] (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD) with the momentum argument.

#### Structural mapping to native TF2

Before:

```python optimizer = tf.compat.v1.train.MomentumOptimizer(

learning_rate=learning_rate, momentum=momentum, use_nesterov=use_nesterov)

```

After:

```python optimizer = tf.keras.optimizers.SGD(

learning_rate=learning_rate, momentum=momentum, nesterov=use_nesterov)

```

#### Before & after usage example Before:

```python x = tf.Variable([1,2,3], dtype=tf.float32) grad = tf.constant([0.1, 0.2, 0.3]) optimizer = tf.compat.v1.train.MomentumOptimizer(

learning_rate=0.001, momentum=0.9, use_nesterov=False)

optimizer.apply_gradients(zip([grad], [x])) ```

After:

```python x = tf.Variable([1,2,3], dtype=tf.float32) grad = tf.constant([0.1, 0.2, 0.3]) optimizer = tf.keras.optimizers.SGD(

learning_rate=0.001, momentum=0.9, nesterov=False)

optimizer.apply_gradients(zip([grad], [x])) ```

@end_compatibility

Construct a new Momentum optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. momentum: A Tensor or a floating point value. The momentum. use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Momentum”.

use_nesterov: If True use Nesterov Momentum.: See (Sutskever et al., 2013). This implementation always computes gradients at the value of the variable(s) passed to the optimizer. Using Nesterov Momentum makes the variable(s) track the values called theta_t + mu*v_t in the paper. This implementation is an approximation of the original formula, valid for high values of momentum. It will compute the “adjusted gradient” in NAG by assuming that the new gradient will be estimated by the current average gradient plus the product of momentum and the change in the average gradient.

References:

On the importance of initialization and momentum in deep learning:: [Sutskever et al., 2013] (http://proceedings.mlr.press/v28/sutskever13.html) ([pdf](http://proceedings.mlr.press/v28/sutskever13.pdf))

Nadam¶

class returnn.tf.updater.NadamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')[source]¶

Optimizer that implements the Nadam algorithm. See [Dozat, T., 2015](http://cs229.stanford.edu/proj2015/054_report.pdf).

Copied from: https://github.com/tensorflow/tensorflow/blob/v1.15.5/tensorflow/contrib/opt/python/training/nadam_optimizer.py

We have this here to have this Nadam variant available in TF 2 because the Keras Nadam behaves a bit different. https://github.com/rwth-i6/returnn/issues/766 https://github.com/tensorflow/tensorflow/issues/53204

We can still use this old code because the underlying kernel still supports the use_nesterov option.

Construct a new Adam optimizer.

Initialization:

$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$

The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper:

$$t := t + 1$$ $$text{lr}_t := mathrm{learning_rate} *

sqrt{1 - beta_2^t} / (1 - beta_1^t)$$

$$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$text{variable} := text{variable} -

text{lr}_t * m_t / (sqrt{v_t} + epsilon)$$

Args:

learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay

rate for the 1st moment estimates.

beta2: A float value or a constant float tensor. The exponential decay: rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability. This epsilon is: “epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.

use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “Adam”.

Nadam¶

class keras.optimizers.optimizer_experimental.nadam.Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, weight_decay=None, clipnorm=None, clipvalue=None, global_clipnorm=None, use_ema=False, ema_momentum=0.99, ema_overwrite_frequency=None, jit_compile=True, name='Nadam', **kwargs)[source]¶

Optimizer that implements the Nadam algorithm.

Much like Adam is essentially RMSprop with momentum, Nadam is Adam with Nesterov momentum.

Args:

learning_rate: A tf.Tensor, floating point value, a schedule that is a: tf.keras.optimizers.schedules.LearningRateSchedule, or a callable that takes no arguments and returns the actual value to use. The learning rate. Defaults to 0.001.
beta_1: A float value or a constant float tensor, or a callable: that takes no arguments and returns the actual value to use. The exponential decay rate for the 1st moment estimates. Defaults to 0.9.
beta_2: A float value or a constant float tensor, or a callable: that takes no arguments and returns the actual value to use. The exponential decay rate for the 2nd moment estimates. Defaults to 0.999.
epsilon: A small constant for numerical stability. This epsilon is: “epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper. Defaults to 1e-7.
name: String. The name to use: for momentum accumulator weights created by the optimizer.

weight_decay: Float, defaults to None. If set, weight decay is applied. clipnorm: Float. If set, the gradient of each weight is individually

clipped so that its norm is no higher than this value.

clipvalue: Float. If set, the gradient of each weight is clipped to be no

higher than this value.

global_clipnorm: Float. If set, the gradient of all weights is clipped so

that their global norm is no higher than this value.

use_ema: Boolean, defaults to False. If True, exponential moving average

ema_momentum: Float, defaults to 0.99. Only used if use_ema=True. This is # noqa: E501

the momentum to use when computing the EMA of the model’s weights: new_average = ema_momentum * old_average + (1 - ema_momentum) * current_variable_value.

ema_overwrite_frequency: Int or None, defaults to None. Only used if

use_ema=True. Every ema_overwrite_frequency steps of iterations, we overwrite the model variable by its moving average. If None, the optimizer # noqa: E501

does not overwrite model variables in the middle of training, and you

jit_compile: Boolean, defaults to True. If True, the optimizer will use XLA # noqa: E501

compilation. If no GPU device is found, this flag will be ignored.

**kwargs: keyword arguments only used for backward compatibility.

Reference:

[Dozat, 2015](http://cs229.stanford.edu/proj2015/054_report.pdf).

Create a new Optimizer.

build(var_list)[source]¶

Initialize optimizer variables.

Nadam optimizer has 2 types of variables: momentums and velocities.

Args:: var_list: list of model variables to build Nadam variables on.

update_step(gradient, variable)[source]¶: Update step given gradient and the associated model variable.

get_config()[source]¶

Returns the config of the optimizer.

Subclass optimizer should override this method to include other hyperparameters.

Returns:: Python dictionary.

NeuralOptimizer1¶

class returnn.tf.updater.NeuralOptimizer1(beta1=0.9, decrease_factor=0.1, **kwargs)[source]¶

Via Neural Optimizer Search with Reinforcement Learning (https://proceedings.mlr.press/v70/bello17a/bello17a.pdf).

Equivalent to the optimizer g * exp(sign(g) * sign(m)), we use:

g * where(sign(g) == sign(m), 1.0, decrease_factor)

where m is the running average of g.

Calculation of m: m_t <- beta1 * m_{t-1} + (1 - beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9

Parameters:

beta1 (float) – used for the running average of m
decrease_factor (float) – in the original paper, it is e^-2 ~= 0.135

NormalizedSGD¶

class returnn.tf.updater.NormalizedSGD(learning_rate, use_locking=False, name=None)[source]¶

All grads are L2 normalized (via tf.nn.l2_normalize()), otherwise it’s standard SGD. Via: https://github.com/kmkolasinski/deep-learning-notes/tree/master/max-normed-optimizer

Construct a new optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning: rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.

ProximalAdagrad¶

class tensorflow.python.training.proximal_adagrad.ProximalAdagradOptimizer(learning_rate, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='ProximalAdagrad')[source]¶

Optimizer that implements the Proximal Adagrad algorithm.

References:

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization:: [Duchi et al., 2011](http://jmlr.org/papers/v12/duchi11a.html) ([pdf](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf))
Efficient Learning using Forward-Backward Splitting:: [Duchi et al., 2009](http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting) ([pdf](http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf))

Construct a new ProximalAdagrad optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. initial_accumulator_value: A floating point value.

Starting value for the accumulators, must be positive.

l1_regularization_strength: A float value, must be greater than or: equal to zero.
l2_regularization_strength: A float value, must be greater than or: equal to zero.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Adagrad”.

Raises:

ValueError: If the initial_accumulator_value is invalid.

ProximalGradientDescent¶

class tensorflow.python.training.proximal_gradient_descent.ProximalGradientDescentOptimizer(learning_rate, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='ProximalGradientDescent')[source]¶

Optimizer that implements the proximal gradient descent algorithm.

References:

Efficient Learning using Forward-Backward Splitting:: [Duchi et al., 2009](http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting) ([pdf](http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf))

Construct a new proximal gradient descent optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning: rate to use.
l1_regularization_strength: A float value, must be greater than or: equal to zero.
l2_regularization_strength: A float value, must be greater than or: equal to zero.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “GradientDescent”.

RMSProp¶

class tensorflow.python.training.rmsprop.RMSPropOptimizer(learning_rate, decay=0.9, momentum=0.0, epsilon=1e-10, use_locking=False, centered=False, name='RMSProp')[source]¶

Optimizer that implements the RMSProp algorithm (Tielemans et al.

2012).

References:: Coursera slide 29: Hinton, 2012 ([pdf](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf))

@compatibility(TF2) tf.compat.v1.train.RMSPropOptimizer is compatible with eager mode and tf.function. When eager execution is enabled, learning_rate, decay, momentum, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions.

To switch to native TF2 style, use [tf.keras.optimizers.RMSprop] (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop) instead. Please notice that due to the implementation differences, tf.keras.optimizers.RMSprop and tf.compat.v1.train.RMSPropOptimizer may have slight differences in floating point numerics even though the formula used for the variable updates still matches.

#### Structural mapping to native TF2

Before:

```python optimizer = tf.compat.v1.train.RMSPropOptimizer(

learning_rate=learning_rate, decay=decay, momentum=momentum, epsilon=epsilon)

```

After:

```python optimizer = tf.keras.optimizers.RMSprop(

learning_rate=learning_rate, rho=decay, momentum=momentum, epsilon=epsilon)

```

#### Before & after usage example Before:

`python x = tf.Variable([1,2,3], dtype=tf.float32) grad = tf.constant([0.1, 0.2, 0.3]) optimizer = tf.compat.v1.train.RMSPropOptimizer(learning_rate=0.001) optimizer.apply_gradients(zip([grad], [x])) `

After:

`python x = tf.Variable([1,2,3], dtype=tf.float32) grad = tf.constant([0.1, 0.2, 0.3]) optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001) optimizer.apply_gradients(zip([grad], [x])) `

@end_compatibility

Construct a new RMSProp optimizer.

Note that in the dense implementation of this algorithm, variables and their corresponding accumulators (momentum, gradient moving average, square gradient moving average) will be updated even if the gradient is zero (i.e. accumulators will decay, momentum will be applied). The sparse implementation (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) will not update variable slices or their accumulators unless those slices were used in the forward pass (nor is there an “eventual” correction to account for these omitted updates). This leads to more efficient updates for large embedding lookup tables (where most of the slices are not accessed in a particular graph execution), but differs from the published algorithm.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. decay: Discounting factor for the history/coming gradient momentum: A scalar tensor. epsilon: Small value to avoid zero denominator. use_locking: If True use locks for update operation. centered: If True, gradients are normalized by the estimated variance of

the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.

name: Optional name prefix for the operations created when applying: gradients. Defaults to “RMSProp”.

SGD¶

class keras.optimizers.optimizer_experimental.sgd.SGD(learning_rate=0.01, momentum=0.0, nesterov=False, amsgrad=False, weight_decay=None, clipnorm=None, clipvalue=None, global_clipnorm=None, use_ema=False, ema_momentum=0.99, ema_overwrite_frequency=None, jit_compile=True, name='SGD', **kwargs)[source]¶

Gradient descent (with momentum) optimizer.

Update rule for parameter w with gradient g when momentum is 0:

`python w = w - learning_rate * g `

Update rule when momentum is larger than 0:

`python velocity = momentum * velocity - learning_rate * g w = w + velocity `

When nesterov=True, this rule becomes:

`python velocity = momentum * velocity - learning_rate * g w = w + momentum * velocity - learning_rate * g `

Args:

learning_rate: A Tensor, floating point value, or a schedule that is a: tf.keras.optimizers.schedules.LearningRateSchedule, or a callable that takes no arguments and returns the actual value to use. The learning rate. Defaults to 0.001.
momentum: float hyperparameter >= 0 that accelerates gradient descent in: the relevant direction and dampens oscillations. Defaults to 0, i.e., vanilla gradient descent.
nesterov: boolean. Whether to apply Nesterov momentum.: Defaults to False.
name: String. The name to use: for momentum accumulator weights created by the optimizer.

weight_decay: Float, defaults to None. If set, weight decay is applied. clipnorm: Float. If set, the gradient of each weight is individually

clipped so that its norm is no higher than this value.

clipvalue: Float. If set, the gradient of each weight is clipped to be no

higher than this value.

global_clipnorm: Float. If set, the gradient of all weights is clipped so

that their global norm is no higher than this value.

use_ema: Boolean, defaults to False. If True, exponential moving average

ema_momentum: Float, defaults to 0.99. Only used if use_ema=True. This is # noqa: E501

the momentum to use when computing the EMA of the model’s weights: new_average = ema_momentum * old_average + (1 - ema_momentum) * current_variable_value.

ema_overwrite_frequency: Int or None, defaults to None. Only used if

use_ema=True. Every ema_overwrite_frequency steps of iterations, we overwrite the model variable by its moving average. If None, the optimizer # noqa: E501

does not overwrite model variables in the middle of training, and you

jit_compile: Boolean, defaults to True. If True, the optimizer will use XLA # noqa: E501

compilation. If no GPU device is found, this flag will be ignored.

**kwargs: keyword arguments only used for backward compatibility.

Usage:

>>> opt = tf.keras.optimizers.experimental.SGD(learning_rate=0.1)
>>> var = tf.Variable(1.0)
>>> loss = lambda: (var ** 2)/2.0         # d(loss)/d(var1) = var1
>>> opt.minimize(loss, [var])
>>> # Step is `- learning_rate * grad`
>>> var.numpy()
0.9

>>> opt = tf.keras.optimizers.experimental.SGD(0.1, momentum=0.9)
>>> var = tf.Variable(1.0)
>>> val0 = var.value()
>>> loss = lambda: (var ** 2)/2.0         # d(loss)/d(var1) = var1
>>> # First step is `- learning_rate * grad`
>>> opt.minimize(loss, [var])
>>> val1 = var.value()
>>> (val0 - val1).numpy()
0.1
>>> # On later steps, step-size increases because of momentum
>>> opt.minimize(loss, [var])
>>> val2 = var.value()
>>> (val1 - val2).numpy()
0.18

Reference:

For nesterov=True, See [Sutskever et al., 2013]( http://jmlr.org/proceedings/papers/v28/sutskever13.pdf).

Create a new Optimizer.

build(var_list)[source]¶

Initialize optimizer variables.

SGD optimizer has one variable momentums, only set if self.momentum is not 0.

Args:: var_list: list of model variables to build SGD variables on.

update_step(gradient, variable)[source]¶: Update step given gradient and the associated model variable.

get_config()[source]¶

Returns the config of the optimizer.

Subclass optimizer should override this method to include other hyperparameters.

Returns:: Python dictionary.

SyncReplicas¶

class tensorflow.python.training.sync_replicas_optimizer.SyncReplicasOptimizer(opt, replicas_to_aggregate, total_num_replicas=None, variable_averages=None, variables_to_average=None, use_locking=False, name='sync_replicas')[source]¶

Class to synchronize, aggregate gradients and pass them to the optimizer.

This class is deprecated. For synchronous training, please use [Distribution Strategies](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).

In a typical asynchronous training environment, it’s common to have some stale gradients. For example, with a N-replica asynchronous training, gradients will be applied to the variables N times independently. Depending on each replica’s training speed, some gradients might be calculated from copies of the variable from several steps back (N-1 steps on average). This optimizer avoids stale gradients by collecting gradients from all replicas, averaging them, then applying them to the variables in one shot, after which replicas can fetch the new variables and continue.

The following accumulators/queue are created:

N gradient accumulators, one per variable to train. Gradients are pushed to them and the chief worker will wait until enough gradients are collected and then average them before applying to variables. The accumulator will drop all stale gradients (more details in the accumulator op).
1 token queue where the optimizer pushes the new global_step value after all variables are updated.

The following local variable is created: * sync_rep_local_step, one per replica. Compared against the global_step in

each accumulator to check for staleness of the gradients.

The optimizer adds nodes to the graph to collect gradients and pause the trainers until variables are updated. For the Parameter Server job:

An accumulator is created for each variable, and each replica pushes the gradients into the accumulators instead of directly applying them to the variables.
Each accumulator averages once enough gradients (replicas_to_aggregate) have been accumulated.
Apply the averaged gradients to the variables.
Only after all variables have been updated, increment the global step.
Only after step 4, pushes global_step in the token_queue, once for each worker replica. The workers can now fetch the global step, use it to update its local_step variable and start the next batch. Please note that some workers can consume multiple minibatches, while some may not consume even one. This is because each worker fetches minibatches as long as a token exists. If one worker is stuck for some reason and does not consume a token, another worker can use it.

For the replicas:

Start a step: fetch variables and compute gradients.
Once the gradients have been computed, push them into gradient accumulators. Each accumulator will check the staleness and drop the stale.
After pushing all the gradients, dequeue an updated value of global_step from the token queue and record that step to its local_step variable. Note that this is effectively a barrier.
Start the next batch.

### Usage

```python # Create any optimizer to update the variables, say a simple SGD: opt = GradientDescentOptimizer(learning_rate=0.1)

# Wrap the optimizer with sync_replicas_optimizer with 50 replicas: at each # step the optimizer collects 50 gradients before applying to variables. # Note that if you want to have 2 backup replicas, you can change # total_num_replicas=52 and make sure this number matches how many physical # replicas you started in your job. opt = tf.compat.v1.train.SyncReplicasOptimizer(opt, replicas_to_aggregate=50,

total_num_replicas=50)

# Some models have startup_delays to help stabilize the model but when using # sync_replicas training, set it to 0.

# Now you can call minimize() or compute_gradients() and # apply_gradients() normally training_op = opt.minimize(total_loss, global_step=self.global_step)

# You can create the hook which handles initialization and queues. sync_replicas_hook = opt.make_session_run_hook(is_chief) ```

In the training program, every worker will run the train_op as if not synchronized.

```python with training.MonitoredTrainingSession(

master=workers[worker_id].target, is_chief=is_chief, hooks=[sync_replicas_hook]) as mon_sess:

while not mon_sess.should_stop():
mon_sess.run(training_op)

```

To use SyncReplicasOptimizer with an Estimator, you need to send sync_replicas_hook while calling the fit. `python my_estimator = DNNClassifier(..., optimizer=opt) my_estimator.fit(..., hooks=[sync_replicas_hook]) `

Construct a sync_replicas optimizer. (deprecated)

Deprecated: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: The SyncReplicaOptimizer class is deprecated. For synchronous training, please use [Distribution Strategies](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).

Args:

opt: The actual optimizer that will be used to compute and apply the: gradients. Must be one of the Optimizer classes.
replicas_to_aggregate: number of replicas to aggregate for each variable: update.
total_num_replicas: Total number of tasks/workers/replicas, could be: different from replicas_to_aggregate. If total_num_replicas > replicas_to_aggregate: it is backup_replicas + replicas_to_aggregate. If total_num_replicas < replicas_to_aggregate: Replicas compute multiple batches per update to variables.
variable_averages: Optional ExponentialMovingAverage object, used to: maintain moving averages for the variables passed in variables_to_average.
variables_to_average: a list of variables that need to be averaged. Only: needed if variable_averages is passed in.

use_locking: If True use locks for update operation. name: string. Optional name of the returned operation.

compute_gradients(*args, **kwargs)[source]¶

Compute gradients of “loss” for the variables in “var_list”.

This simply wraps the compute_gradients() from the real optimizer. The gradients will be aggregated in the apply_gradients() so that user can modify the gradients like clipping with per replica global norm if needed. The global norm with aggregated gradients can be bad as one replica’s huge gradients can hurt the gradients from other replicas.

Args:: *args: Arguments for compute_gradients(). **kwargs: Keyword arguments for compute_gradients().
Returns:: A list of (gradient, variable) pairs.

apply_gradients(grads_and_vars, global_step=None, name=None)[source]¶

Apply gradients to variables.

This contains most of the synchronization implementation and also wraps the apply_gradients() from the real optimizer.

Args:

grads_and_vars: List of (gradient, variable) pairs as returned by: compute_gradients().
global_step: Optional Variable to increment by one after the: variables have been updated.
name: Optional name for the returned operation. Default to the: name passed to the Optimizer constructor.

Returns:

train_op: The op to dequeue a token so the replicas can exit this batch and start the next one. This is executed by each replica.

Raises:

ValueError: If the grads_and_vars is empty. ValueError: If global step is not provided, the staleness cannot be

checked.

get_chief_queue_runner()[source]¶

Returns the QueueRunner for the chief to execute.

This includes the operations to synchronize replicas: aggregate gradients, apply to variables, increment global step, insert tokens to token queue.

Note that this can only be called after calling apply_gradients() which actually generates this queuerunner.

Returns:: A QueueRunner for chief to execute.
Raises:: ValueError: If this is called before apply_gradients().

get_slot(*args, **kwargs)[source]¶

Return a slot named “name” created for “var” by the Optimizer.

This simply wraps the get_slot() from the actual optimizer.

Args:: *args: Arguments for get_slot(). **kwargs: Keyword arguments for get_slot().
Returns:: The Variable for the slot if it was created, None otherwise.

variables()[source]¶

Fetches a list of optimizer variables in the default graph.

This wraps variables() from the actual optimizer. It does not include the SyncReplicasOptimizer’s local step.

Returns:: A list of variables.

get_slot_names(*args, **kwargs)[source]¶

Return a list of the names of slots created by the Optimizer.

This simply wraps the get_slot_names() from the actual optimizer.

Args:: *args: Arguments for get_slot(). **kwargs: Keyword arguments for get_slot().
Returns:: A list of strings.

get_init_tokens_op(num_tokens=-1)[source]¶

Returns the op to fill the sync_token_queue with the tokens.

This is supposed to be executed in the beginning of the chief/sync thread so that even if the total_num_replicas is less than replicas_to_aggregate, the model can still proceed as the replicas can compute multiple steps per variable update. Make sure: num_tokens >= replicas_to_aggregate - total_num_replicas.

Args:: num_tokens: Number of tokens to add to the queue.
Returns:: An op for the chief/sync replica to fill the token queue.
Raises:: ValueError: If this is called before apply_gradients(). ValueError: If num_tokens are smaller than replicas_to_aggregate -

total_num_replicas.

make_session_run_hook(is_chief, num_tokens=-1)[source]¶: Creates a hook to handle SyncReplicasHook ops such as initialization.