Optimizer

This is a list of all optimizers that can be used with RETURNN. If you are looking on how to set the optimizer correctly in the RETURNN config, please have a look at the optimizer settings.

Adadelta

class tensorflow.python.training.adadelta.AdadeltaOptimizer(learning_rate=0.001, rho=0.95, epsilon=1e-08, use_locking=False, name='Adadelta')[source]

Optimizer that implements the Adadelta algorithm.

References:
ADADELTA - An Adaptive Learning Rate Method:
[Zeiler, 2012](http://arxiv.org/abs/1212.5701) ([pdf](http://arxiv.org/pdf/1212.5701v1.pdf))

Construct a new Adadelta optimizer.

Args:
learning_rate: A Tensor or a floating point value. The learning rate.
To match the exact form in the original paper use 1.0.

rho: A Tensor or a floating point value. The decay rate. epsilon: A Tensor or a floating point value. A constant epsilon used

to better conditioning the grad update.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Adadelta”.

@compatibility(eager) When eager execution is enabled, learning_rate, rho, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

Adagrad

class tensorflow.python.training.adagrad.AdagradOptimizer(learning_rate, initial_accumulator_value=0.1, use_locking=False, name='Adagrad')[source]

Optimizer that implements the Adagrad algorithm.

References:
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
:[Duchi et al., 2011](http://jmlr.org/papers/v12/duchi11a.html) ([pdf](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf))

Construct a new Adagrad optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. initial_accumulator_value: A floating point value.

Starting value for the accumulators, must be positive.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Adagrad”.
Raises:
ValueError: If the initial_accumulator_value is invalid.

@compatibility(eager) When eager execution is enabled, learning_rate can be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

AdagradDA

class tensorflow.python.training.adagrad_da.AdagradDAOptimizer(learning_rate, global_step, initial_gradient_squared_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='AdagradDA')[source]

Adagrad Dual Averaging algorithm for sparse linear models.

This optimizer takes care of regularization of unseen features in a mini batch by updating them when they are seen with a closed form update rule that is equivalent to having updated them on every mini-batch.

AdagradDA is typically used when there is a need for large sparsity in the trained model. This optimizer only guarantees sparsity for linear models. Be careful when using AdagradDA for deep networks as it will require careful initialization of the gradient accumulators for it to train.

References:
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
:[Duchi et al., 2011](http://jmlr.org/papers/v12/duchi11a.html) ([pdf](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf))

Construct a new AdagradDA optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. global_step: A Tensor containing the current training step number. initial_gradient_squared_accumulator_value: A floating point value.

Starting value for the accumulators, must be positive.
l1_regularization_strength: A float value, must be greater than or
equal to zero.
l2_regularization_strength: A float value, must be greater than or
equal to zero.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “AdagradDA”.
Raises:
ValueError: If the initial_gradient_squared_accumulator_value is invalid.

Adam

class tensorflow.python.training.adam.AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')[source]

Optimizer that implements the Adam algorithm.

References:
Adam - A Method for Stochastic Optimization:
[Kingma et al., 2015](https://arxiv.org/abs/1412.6980) ([pdf](https://arxiv.org/pdf/1412.6980.pdf))

Construct a new Adam optimizer.

Initialization:

$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$

The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper:

$$t := t + 1$$ $$text{lr}_t := mathrm{learning_rate} *

sqrt{1 - beta_2^t} / (1 - beta_1^t)$$

$$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$text{variable} := text{variable} -

text{lr}_t * m_t / (sqrt{v_t} + epsilon)$$

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).

Args:

learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay

rate for the 1st moment estimates.
beta2: A float value or a constant float tensor. The exponential decay
rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability. This epsilon is
“epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.

use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “Adam”.

@compatibility(eager) When eager execution is enabled, learning_rate, beta1, beta2, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

Adamax

class tensorflow.python.keras.optimizer_v2.adamax.Adamax(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name='Adamax', **kwargs)[source]

Optimizer that implements the Adamax algorithm.

It is a variant of Adam based on the infinity norm. Default parameters follow those provided in the paper. Adamax is sometimes superior to adam, specially in models with embeddings.

Initialization:

`python m = 0  # Initialize initial 1st moment vector v = 0  # Initialize the exponentially weighted infinity norm t = 0  # Initialize timestep `

The update rule for parameter w with gradient g is described at the end of section 7.1 of the paper:

`python t += 1 m = beta1 * m + (1 - beta) * g v = max(beta2 * v, abs(g)) current_lr = learning_rate / (1 - beta1 ** t) w = w - current_lr * m / (v + epsilon) `

Similarly to Adam, the epsilon is added for numerical stability (especially to get rid of division by zero when v_t == 0).

In contrast to Adam, the sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) only updates variable slices and corresponding m_t, v_t terms when that part of the variable was used in the forward pass. This means that the sparse behavior is contrast to the dense behavior (similar to some momentum implementations which ignore momentum unless a variable slice was actually used).

Args:
learning_rate: A Tensor, floating point value, or a schedule that is a
tf.keras.optimizers.schedules.LearningRateSchedule. The learning rate.
beta_1: A float value or a constant float tensor. The exponential decay
rate for the 1st moment estimates.
beta_2: A float value or a constant float tensor. The exponential decay
rate for the exponentially weighted infinity norm.

epsilon: A small constant for numerical stability. name: Optional name for the operations created when applying gradients.

Defaults to “Adamax”.
**kwargs: Keyword arguments. Allowed to be one of
“clipnorm” or “clipvalue”. “clipnorm” (float) clips gradients by norm; “clipvalue” (float) clips gradients by value.
Reference:
get_config()[source]

Returns the config of the optimizer.

An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.

Returns:
Python dictionary.

AMSGrad

class returnn.tf.updater.AMSGradOptimizer(learning_rate=0.001, decay=False, beta1=0.9, beta2=0.99, epsilon=0.0, var_list=())[source]

https://colab.research.google.com/notebook#fileId=1xXFAuHM2Ae-OmF5M8Cn9ypGCa_HHBgfG&scrollTo=N1-2wPHN1Otn https://openreview.net/pdf?id=ryQu7f-RZ https://keras.io/optimizers/ https://ruder.io/deep-learning-optimization-2017/index.html#fixingtheexponentialmovingaverage https://github.com/taki0112/AMSGrad-Tensorflow

apply_gradients(gradient_variables)[source]
Parameters:gradient_variables (list[(tf.Tensor,tf.Variable)]) –
Return type:tf.Operation

BaseCustom

class returnn.tf.updater.BaseCustomOptimizer(learning_rate, use_locking=False, name=None)[source]

Base class for our own optimizer implementations. This simplifies the interface to be implemented a bit from Optimizer. You just have to implement _apply() here. See CustomGradientDescentOptimizer or CustomAdamOptimizer for as an example.

Construct a new optimizer.

Args:
learning_rate: A Tensor or a floating point value. The learning
rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.

CustomAdam

class returnn.tf.updater.CustomAdamOptimizer(beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)[source]

Reimplementation of Adam. See also tf.compat.v1.train.AdamOptimizer.

``` t <- t + 1 lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon) ```

Parameters:
  • beta1 (float) – used for the running average of g (m)
  • beta2 (float) – used for the running average of g*g (v)
  • epsilon (float) –

CustomGradientDescent

class returnn.tf.updater.CustomGradientDescentOptimizer(learning_rate, use_locking=False, name=None)[source]

Just an example implementation for simple gradient descent.

Construct a new optimizer.

Args:
learning_rate: A Tensor or a floating point value. The learning
rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.

Ftrl

class tensorflow.python.training.ftrl.FtrlOptimizer(learning_rate, learning_rate_power=-0.5, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='Ftrl', accum_name=None, linear_name=None, l2_shrinkage_regularization_strength=0.0, beta=None)[source]

Optimizer that implements the FTRL algorithm.

This version has support for both online L2 (McMahan et al., 2013) and shrinkage-type L2, which is the addition of an L2 penalty to the loss function.

References:
Ad-click prediction:
[McMahan et al., 2013](https://dl.acm.org/citation.cfm?id=2488200) ([pdf](https://dl.acm.org/ft_gateway.cfm?id=2488200&ftid=1388399&dwn=1&CFID=32233078&CFTOKEN=d60fe57a294c056a-CB75C374-F915-E7A6-1573FBBC7BF7D526))

Construct a new FTRL optimizer.

Args:

learning_rate: A float value or a constant float Tensor. learning_rate_power: A float value, must be less or equal to zero.

Controls how the learning rate decreases during training. Use zero for a fixed learning rate. See section 3.1 in (McMahan et al., 2013).
initial_accumulator_value: The starting value for accumulators.
Only zero or positive values are allowed.
l1_regularization_strength: A float value, must be greater than or
equal to zero.
l2_regularization_strength: A float value, must be greater than or
equal to zero.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Ftrl”.
accum_name: The suffix for the variable that keeps the gradient squared
accumulator. If not present, defaults to name.
linear_name: The suffix for the variable that keeps the linear gradient
accumulator. If not present, defaults to name + “_1”.
l2_shrinkage_regularization_strength: A float value, must be greater than

or equal to zero. This differs from L2 above in that the L2 above is a stabilization penalty, whereas this L2 shrinkage is a magnitude penalty. The FTRL formulation can be written as: w_{t+1} = argmin_w(hat{g}_{1:t}w + L1*||w||_1 + L2*||w||_2^2), where hat{g} = g + (2*L2_shrinkage*w), and g is the gradient of the loss function w.r.t. the weights w. Specifically, in the absence of L1 regularization, it is equivalent to the following update rule: w_{t+1} = w_t - lr_t / (beta + 2*L2*lr_t) * g_t -

2*L2_shrinkage*lr_t / (beta + 2*L2*lr_t) * w_t

where lr_t is the learning rate at t. When input is sparse shrinkage will only happen on the active weights.

beta: A float value; corresponds to the beta parameter in the paper.

Raises:
ValueError: If one of the arguments is invalid.
References:
Ad-click prediction:
[McMahan et al., 2013](https://dl.acm.org/citation.cfm?id=2488200) ([pdf](https://dl.acm.org/ft_gateway.cfm?id=2488200&ftid=1388399&dwn=1&CFID=32233078&CFTOKEN=d60fe57a294c056a-CB75C374-F915-E7A6-1573FBBC7BF7D526))

GradientDescent

class tensorflow.python.training.gradient_descent.GradientDescentOptimizer(learning_rate, use_locking=False, name='GradientDescent')[source]

Optimizer that implements the gradient descent algorithm.

Construct a new gradient descent optimizer.

Args:
learning_rate: A Tensor or a floating point value. The learning
rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “GradientDescent”.

@compatibility(eager) When eager execution is enabled, learning_rate can be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

GradVarianceScaled

class returnn.tf.updater.GradVarianceScaledOptimizer(beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)[source]

Let m be the running average of g. Calculation of m: m_t <- beta1 * m_{t-1} + (1 - beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9

Let v be the running average of the variance of g, i.e. of (g - m)^2.

Parameters:
  • beta1 (float) – used for the running average of g (m)
  • beta2 (float) – used for the running average of variance of g (v)
  • epsilon (float) –

Momentum

class tensorflow.python.training.momentum.MomentumOptimizer(learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=False)[source]

Optimizer that implements the Momentum algorithm.

Computes (if use_nesterov = False):

` accumulation = momentum * accumulation + gradient variable -= learning_rate * accumulation `

Note that in the dense version of this algorithm, accumulation is updated and applied regardless of a gradient’s value, whereas the sparse version (when the gradient is an IndexedSlices, typically because of tf.gather or an embedding) only updates variable slices and corresponding accumulation terms when that part of the variable was used in the forward pass.

Construct a new Momentum optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. momentum: A Tensor or a floating point value. The momentum. use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Momentum”.
use_nesterov: If True use Nesterov Momentum.
See (Sutskever et al., 2013). This implementation always computes gradients at the value of the variable(s) passed to the optimizer. Using Nesterov Momentum makes the variable(s) track the values called theta_t + mu*v_t in the paper. This implementation is an approximation of the original formula, valid for high values of momentum. It will compute the “adjusted gradient” in NAG by assuming that the new gradient will be estimated by the current average gradient plus the product of momentum and the change in the average gradient.
References:
On the importance of initialization and momentum in deep learning:
[Sutskever et al., 2013] (http://proceedings.mlr.press/v28/sutskever13.html) ([pdf](http://proceedings.mlr.press/v28/sutskever13.pdf))

@compatibility(eager) When eager execution is enabled, learning_rate and momentum can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

Nadam

class tensorflow.python.keras.optimizer_v2.nadam.Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name='Nadam', **kwargs)[source]

Optimizer that implements the NAdam algorithm. Much like Adam is essentially RMSprop with momentum, Nadam is Adam with Nesterov momentum.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. beta_1: A float value or a constant float tensor. The exponential decay

rate for the 1st moment estimates.
beta_2: A float value or a constant float tensor. The exponential decay
rate for the exponentially weighted infinity norm.

epsilon: A small constant for numerical stability. name: Optional name for the operations created when applying gradients.

Defaults to “Nadam”.
**kwargs: Keyword arguments. Allowed to be one of
“clipnorm” or “clipvalue”. “clipnorm” (float) clips gradients by norm; “clipvalue” (float) clips gradients by value.
Usage Example:
>>> opt = tf.keras.optimizers.Nadam(learning_rate=0.2)
>>> var1 = tf.Variable(10.0)
>>> loss = lambda: (var1 ** 2) / 2.0
>>> step_count = opt.minimize(loss, [var1]).numpy()
>>> "{:.1f}".format(var1.numpy())
9.8
Reference:
get_config()[source]

Returns the config of the optimizer.

An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.

Returns:
Python dictionary.

NeuralOptimizer1

class returnn.tf.updater.NeuralOptimizer1(beta1=0.9, decrease_factor=0.1, **kwargs)[source]

Via Neural Optimizer Search with Reinforcement Learning (https://proceedings.mlr.press/v70/bello17a/bello17a.pdf).

Equivalent to the optimizer g * exp(sign(g) * sign(m)), we use:

g * where(sign(g) == sign(m), 1.0, decrease_factor)

where m is the running average of g.

Calculation of m: m_t <- beta1 * m_{t-1} + (1 - beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9

Parameters:
  • beta1 (float) – used for the running average of m
  • decrease_factor (float) – in the original paper, it is e^-2 ~= 0.135

NormalizedSGD

class returnn.tf.updater.NormalizedSGD(learning_rate, use_locking=False, name=None)[source]

All grads are L2 normalized (via tf.nn.l2_normalize()), otherwise it’s standard SGD. Via: https://github.com/kmkolasinski/deep-learning-notes/tree/master/max-normed-optimizer

Construct a new optimizer.

Args:
learning_rate: A Tensor or a floating point value. The learning
rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.

ProximalAdagrad

class tensorflow.python.training.proximal_adagrad.ProximalAdagradOptimizer(learning_rate, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='ProximalAdagrad')[source]

Optimizer that implements the Proximal Adagrad algorithm.

References:
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization:
[Duchi et al., 2011](http://jmlr.org/papers/v12/duchi11a.html) ([pdf](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf))
Efficient Learning using Forward-Backward Splitting:
[Duchi et al., 2009](http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting) ([pdf](http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf))

Construct a new ProximalAdagrad optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. initial_accumulator_value: A floating point value.

Starting value for the accumulators, must be positive.
l1_regularization_strength: A float value, must be greater than or
equal to zero.
l2_regularization_strength: A float value, must be greater than or
equal to zero.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Adagrad”.
Raises:
ValueError: If the initial_accumulator_value is invalid.

ProximalGradientDescent

class tensorflow.python.training.proximal_gradient_descent.ProximalGradientDescentOptimizer(learning_rate, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='ProximalGradientDescent')[source]

Optimizer that implements the proximal gradient descent algorithm.

References:
Efficient Learning using Forward-Backward Splitting:
[Duchi et al., 2009](http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting) ([pdf](http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf))

Construct a new proximal gradient descent optimizer.

Args:
learning_rate: A Tensor or a floating point value. The learning
rate to use.
l1_regularization_strength: A float value, must be greater than or
equal to zero.
l2_regularization_strength: A float value, must be greater than or
equal to zero.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “GradientDescent”.

RMSProp

class tensorflow.python.training.rmsprop.RMSPropOptimizer(learning_rate, decay=0.9, momentum=0.0, epsilon=1e-10, use_locking=False, centered=False, name='RMSProp')[source]

Optimizer that implements the RMSProp algorithm (Tielemans et al.

2012).

References:
Coursera slide 29: Hinton, 2012 ([pdf](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf))

Construct a new RMSProp optimizer.

Note that in the dense implementation of this algorithm, variables and their corresponding accumulators (momentum, gradient moving average, square gradient moving average) will be updated even if the gradient is zero (i.e. accumulators will decay, momentum will be applied). The sparse implementation (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) will not update variable slices or their accumulators unless those slices were used in the forward pass (nor is there an “eventual” correction to account for these omitted updates). This leads to more efficient updates for large embedding lookup tables (where most of the slices are not accessed in a particular graph execution), but differs from the published algorithm.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. decay: Discounting factor for the history/coming gradient momentum: A scalar tensor. epsilon: Small value to avoid zero denominator. use_locking: If True use locks for update operation. centered: If True, gradients are normalized by the estimated variance of

the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.
name: Optional name prefix for the operations created when applying
gradients. Defaults to “RMSProp”.

@compatibility(eager) When eager execution is enabled, learning_rate, decay, momentum, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

SGD

class tensorflow.python.keras.optimizer_v2.gradient_descent.SGD(learning_rate=0.01, momentum=0.0, nesterov=False, name='SGD', **kwargs)[source]

Gradient descent (with momentum) optimizer.

Update rule for parameter w with gradient g when momentum is 0:

`python w = w - learning_rate * g `

Update rule when momentum is larger than 0:

`python velocity = momentum * velocity - learning_rate * g w = w + velocity `

When nesterov=True, this rule becomes:

`python velocity = momentum * velocity - learning_rate * g w = w + momentum * velocity - learning_rate * g `

Args:
learning_rate: A Tensor, floating point value, or a schedule that is a
tf.keras.optimizers.schedules.LearningRateSchedule, or a callable that takes no arguments and returns the actual value to use. The learning rate. Defaults to 0.01.
momentum: float hyperparameter >= 0 that accelerates gradient descent
in the relevant direction and dampens oscillations. Defaults to 0, i.e., vanilla gradient descent.
nesterov: boolean. Whether to apply Nesterov momentum.
Defaults to False.
name: Optional name prefix for the operations created when applying
gradients. Defaults to “SGD”.
**kwargs: Keyword arguments. Allowed to be one of
“clipnorm” or “clipvalue”. “clipnorm” (float) clips gradients by norm; “clipvalue” (float) clips gradients by value.

Usage:

>>> opt = tf.keras.optimizers.SGD(learning_rate=0.1)
>>> var = tf.Variable(1.0)
>>> loss = lambda: (var ** 2)/2.0         # d(loss)/d(var1) = var1
>>> step_count = opt.minimize(loss, [var]).numpy()
>>> # Step is `- learning_rate * grad`
>>> var.numpy()
0.9
>>> opt = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
>>> var = tf.Variable(1.0)
>>> val0 = var.value()
>>> loss = lambda: (var ** 2)/2.0         # d(loss)/d(var1) = var1
>>> # First step is `- learning_rate * grad`
>>> step_count = opt.minimize(loss, [var]).numpy()
>>> val1 = var.value()
>>> (val0 - val1).numpy()
0.1
>>> # On later steps, step-size increases because of momentum
>>> step_count = opt.minimize(loss, [var]).numpy()
>>> val2 = var.value()
>>> (val1 - val2).numpy()
0.18
Reference:
get_config()[source]

Returns the config of the optimizer.

An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.

Returns:
Python dictionary.

SyncReplicas

class tensorflow.python.training.sync_replicas_optimizer.SyncReplicasOptimizer(opt, replicas_to_aggregate, total_num_replicas=None, variable_averages=None, variables_to_average=None, use_locking=False, name='sync_replicas')[source]

Class to synchronize, aggregate gradients and pass them to the optimizer.

This class is deprecated. For synchronous training, please use [Distribution Strategies](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).

In a typical asynchronous training environment, it’s common to have some stale gradients. For example, with a N-replica asynchronous training, gradients will be applied to the variables N times independently. Depending on each replica’s training speed, some gradients might be calculated from copies of the variable from several steps back (N-1 steps on average). This optimizer avoids stale gradients by collecting gradients from all replicas, averaging them, then applying them to the variables in one shot, after which replicas can fetch the new variables and continue.

The following accumulators/queue are created:

  • N gradient accumulators, one per variable to train. Gradients are pushed to them and the chief worker will wait until enough gradients are collected and then average them before applying to variables. The accumulator will drop all stale gradients (more details in the accumulator op).
  • 1 token queue where the optimizer pushes the new global_step value after all variables are updated.

The following local variable is created: * sync_rep_local_step, one per replica. Compared against the global_step in

each accumulator to check for staleness of the gradients.

The optimizer adds nodes to the graph to collect gradients and pause the trainers until variables are updated. For the Parameter Server job:

  1. An accumulator is created for each variable, and each replica pushes the gradients into the accumulators instead of directly applying them to the variables.
  2. Each accumulator averages once enough gradients (replicas_to_aggregate) have been accumulated.
  3. Apply the averaged gradients to the variables.
  4. Only after all variables have been updated, increment the global step.
  5. Only after step 4, pushes global_step in the token_queue, once for each worker replica. The workers can now fetch the global step, use it to update its local_step variable and start the next batch. Please note that some workers can consume multiple minibatches, while some may not consume even one. This is because each worker fetches minibatches as long as a token exists. If one worker is stuck for some reason and does not consume a token, another worker can use it.

For the replicas:

  1. Start a step: fetch variables and compute gradients.
  2. Once the gradients have been computed, push them into gradient accumulators. Each accumulator will check the staleness and drop the stale.
  3. After pushing all the gradients, dequeue an updated value of global_step from the token queue and record that step to its local_step variable. Note that this is effectively a barrier.
  4. Start the next batch.

### Usage

```python # Create any optimizer to update the variables, say a simple SGD: opt = GradientDescentOptimizer(learning_rate=0.1)

# Wrap the optimizer with sync_replicas_optimizer with 50 replicas: at each # step the optimizer collects 50 gradients before applying to variables. # Note that if you want to have 2 backup replicas, you can change # total_num_replicas=52 and make sure this number matches how many physical # replicas you started in your job. opt = tf.compat.v1.train.SyncReplicasOptimizer(opt, replicas_to_aggregate=50,

total_num_replicas=50)

# Some models have startup_delays to help stabilize the model but when using # sync_replicas training, set it to 0.

# Now you can call minimize() or compute_gradients() and # apply_gradients() normally training_op = opt.minimize(total_loss, global_step=self.global_step)

# You can create the hook which handles initialization and queues. sync_replicas_hook = opt.make_session_run_hook(is_chief) ```

In the training program, every worker will run the train_op as if not synchronized.

```python with training.MonitoredTrainingSession(

master=workers[worker_id].target, is_chief=is_chief, hooks=[sync_replicas_hook]) as mon_sess:
while not mon_sess.should_stop():
mon_sess.run(training_op)

```

To use SyncReplicasOptimizer with an Estimator, you need to send sync_replicas_hook while calling the fit. `python my_estimator = DNNClassifier(..., optimizer=opt) my_estimator.fit(..., hooks=[sync_replicas_hook]) `

Construct a sync_replicas optimizer. (deprecated)

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: The SyncReplicaOptimizer class is deprecated. For synchronous training, please use [Distribution Strategies](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).

Args:
opt: The actual optimizer that will be used to compute and apply the
gradients. Must be one of the Optimizer classes.
replicas_to_aggregate: number of replicas to aggregate for each variable
update.
total_num_replicas: Total number of tasks/workers/replicas, could be
different from replicas_to_aggregate. If total_num_replicas > replicas_to_aggregate: it is backup_replicas + replicas_to_aggregate. If total_num_replicas < replicas_to_aggregate: Replicas compute multiple batches per update to variables.
variable_averages: Optional ExponentialMovingAverage object, used to
maintain moving averages for the variables passed in variables_to_average.
variables_to_average: a list of variables that need to be averaged. Only
needed if variable_averages is passed in.

use_locking: If True use locks for update operation. name: string. Optional name of the returned operation.

compute_gradients(*args, **kwargs)[source]

Compute gradients of “loss” for the variables in “var_list”.

This simply wraps the compute_gradients() from the real optimizer. The gradients will be aggregated in the apply_gradients() so that user can modify the gradients like clipping with per replica global norm if needed. The global norm with aggregated gradients can be bad as one replica’s huge gradients can hurt the gradients from other replicas.

Args:
*args: Arguments for compute_gradients(). **kwargs: Keyword arguments for compute_gradients().
Returns:
A list of (gradient, variable) pairs.
apply_gradients(grads_and_vars, global_step=None, name=None)[source]

Apply gradients to variables.

This contains most of the synchronization implementation and also wraps the apply_gradients() from the real optimizer.

Args:
grads_and_vars: List of (gradient, variable) pairs as returned by
compute_gradients().
global_step: Optional Variable to increment by one after the
variables have been updated.
name: Optional name for the returned operation. Default to the
name passed to the Optimizer constructor.
Returns:
train_op: The op to dequeue a token so the replicas can exit this batch and start the next one. This is executed by each replica.
Raises:

ValueError: If the grads_and_vars is empty. ValueError: If global step is not provided, the staleness cannot be

checked.
get_chief_queue_runner()[source]

Returns the QueueRunner for the chief to execute.

This includes the operations to synchronize replicas: aggregate gradients, apply to variables, increment global step, insert tokens to token queue.

Note that this can only be called after calling apply_gradients() which actually generates this queuerunner.

Returns:
A QueueRunner for chief to execute.
Raises:
ValueError: If this is called before apply_gradients().
get_slot(*args, **kwargs)[source]

Return a slot named “name” created for “var” by the Optimizer.

This simply wraps the get_slot() from the actual optimizer.

Args:
*args: Arguments for get_slot(). **kwargs: Keyword arguments for get_slot().
Returns:
The Variable for the slot if it was created, None otherwise.
variables()[source]

Fetches a list of optimizer variables in the default graph.

This wraps variables() from the actual optimizer. It does not include the SyncReplicasOptimizer’s local step.

Returns:
A list of variables.
get_slot_names(*args, **kwargs)[source]

Return a list of the names of slots created by the Optimizer.

This simply wraps the get_slot_names() from the actual optimizer.

Args:
*args: Arguments for get_slot(). **kwargs: Keyword arguments for get_slot().
Returns:
A list of strings.
get_init_tokens_op(num_tokens=-1)[source]

Returns the op to fill the sync_token_queue with the tokens.

This is supposed to be executed in the beginning of the chief/sync thread so that even if the total_num_replicas is less than replicas_to_aggregate, the model can still proceed as the replicas can compute multiple steps per variable update. Make sure: num_tokens >= replicas_to_aggregate - total_num_replicas.

Args:
num_tokens: Number of tokens to add to the queue.
Returns:
An op for the chief/sync replica to fill the token queue.
Raises:

ValueError: If this is called before apply_gradients(). ValueError: If num_tokens are smaller than replicas_to_aggregate -

total_num_replicas.
make_session_run_hook(is_chief, num_tokens=-1)[source]

Creates a hook to handle SyncReplicasHook ops such as initialization.