Optimizer

This is a list of all optimizers that can be used with RETURNN. If you are looking on how to set the optimizer correctly in the RETURNN config, please have a look at the optimizer settings.

Adadelta

class tensorflow.python.training.adadelta.AdadeltaOptimizer(learning_rate=0.001, rho=0.95, epsilon=1e-08, use_locking=False, name='Adadelta')[source]

Optimizer that implements the Adadelta algorithm.

See [M. D. Zeiler](http://arxiv.org/abs/1212.5701) ([pdf](http://arxiv.org/pdf/1212.5701v1.pdf))

Construct a new Adadelta optimizer.

Args:
learning_rate: A Tensor or a floating point value. The learning rate.
To match the exact form in the original paper use 1.0.

rho: A Tensor or a floating point value. The decay rate. epsilon: A Tensor or a floating point value. A constant epsilon used

to better conditioning the grad update.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Adadelta”.

@compatibility(eager) When eager execution is enabled, learning_rate, rho, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

Adagrad

class tensorflow.python.training.adagrad.AdagradOptimizer(learning_rate, initial_accumulator_value=0.1, use_locking=False, name='Adagrad')[source]

Optimizer that implements the Adagrad algorithm.

See this [paper](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) or this [intro](https://ppasupat.github.io/a9online/uploads/proximal_notes.pdf).

Construct a new Adagrad optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. initial_accumulator_value: A floating point value.

Starting value for the accumulators, must be positive.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Adagrad”.
Raises:
ValueError: If the initial_accumulator_value is invalid.

@compatibility(eager) When eager execution is enabled, learning_rate can be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

AdagradDA

class tensorflow.python.training.adagrad_da.AdagradDAOptimizer(learning_rate, global_step, initial_gradient_squared_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='AdagradDA')[source]

Adagrad Dual Averaging algorithm for sparse linear models.

See this [paper](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf).

This optimizer takes care of regularization of unseen features in a mini batch by updating them when they are seen with a closed form update rule that is equivalent to having updated them on every mini-batch.

AdagradDA is typically used when there is a need for large sparsity in the trained model. This optimizer only guarantees sparsity for linear models. Be careful when using AdagradDA for deep networks as it will require careful initialization of the gradient accumulators for it to train.

Construct a new AdagradDA optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. global_step: A Tensor containing the current training step number. initial_gradient_squared_accumulator_value: A floating point value.

Starting value for the accumulators, must be positive.
l1_regularization_strength: A float value, must be greater than or
equal to zero.
l2_regularization_strength: A float value, must be greater than or
equal to zero.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “AdagradDA”.
Raises:
ValueError: If the initial_gradient_squared_accumulator_value is invalid.

Adam

class tensorflow.python.training.adam.AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')[source]

Optimizer that implements the Adam algorithm.

See [Kingma et al., 2014](http://arxiv.org/abs/1412.6980) ([pdf](http://arxiv.org/pdf/1412.6980.pdf)).

Construct a new Adam optimizer.

Initialization:

$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$

The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper:

$$t := t + 1$$ $$lr_t := text{learning_rate} * sqrt{1 - beta_2^t} / (1 - beta_1^t)$$

$$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$variable := variable - lr_t * m_t / (sqrt{v_t} + epsilon)$$

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).

Args:

learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay

rate for the 1st moment estimates.
beta2: A float value or a constant float tensor. The exponential decay
rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability. This epsilon is
“epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.

use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “Adam”. @compatibility(eager) When eager execution is enabled, learning_rate, beta1, beta2, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

AdaMax

class tensorflow.contrib.opt.python.training.adamax.AdaMaxOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='AdaMax')[source]

Optimizer that implements the AdaMax algorithm.

Adamax is sometimes superior to adam, specially in models with embeddings, see [Kingma et al., 2014](http://arxiv.org/abs/1412.6980) ([pdf](http://arxiv.org/pdf/1412.6980.pdf)).

Construct a new AdaMax optimizer.

Initialization:

` m_0 <- 0 (Initialize initial 1st moment vector) v_0 <- 0 (Initialize the exponentially weighted infinity norm) t <- 0 (Initialize timestep) `

The update rule for variable with gradient g uses an optimization described at the end of section 7.1 of the paper:

``` t <- t + 1

m_t <- beta1 * m_{t-1} + (1 - beta1) * g v_t <- max(beta2 * v_{t-1}, abs(g)) variable <- variable - learning_rate / (1 - beta1^t) * m_t / (v_t + epsilon) ```

Similar to AdamOptimizer, the epsilon is added for numerical stability (especially to get rid of division by zero when v_t = 0).

Contrast to AdamOptimizer, the sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) only updates variable slices and corresponding m_t, v_t terms when that part of the variable was used in the forward pass. This means that the sparse behavior is contrast to the dense behavior (similar to some momentum implementations which ignore momentum unless a variable slice was actually used).

Args:

learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor.

The exponential decay rate for the 1st moment estimates.
beta2: A float value or a constant float tensor.
The exponential decay rate for the exponentially weighted infinity norm.

epsilon: A small constant for numerical stability. use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “AdaMax”.

AdamGS

class tensorflow.contrib.opt.python.training.adam_gs_optimizer.AdamGSOptimizer(global_step=0, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')[source]

Optimizer that implements the Adam algorithm.

See [Kingma et al., 2014](http://arxiv.org/abs/1412.6980) ([pdf](http://arxiv.org/pdf/1412.6980.pdf)).

Construct a new Adam optimizer.

Branched from tf.train.AdamOptimizer. The only difference is to pass global step for computing beta1 and beta2 accumulators, instead of having optimizer keep its own independent beta1 and beta2 accumulators as non-slot variables.

Initialization:

$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$

The update rule for variable with gradient g uses an optimization described at the end of section2 of the paper:

$$t := t + 1$$ $$lr_t := text{learning_rate} * sqrt{1 - beta_2^t} / (1 - beta_1^t)$$

$$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$variable := variable - lr_t * m_t / (sqrt{v_t} + epsilon)$$

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).

Args:

global_step: tensorflow variable indicating the step. learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay

rate for the 1st moment estimates.
beta2: A float value or a constant float tensor. The exponential decay
rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability. This epsilon is
“epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.

use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “Adam”. @compatibility(eager) When eager execution is enabled, learning_rate, beta1, beta2, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

AdamW

class tensorflow.contrib.opt.python.training.weight_decay_optimizers.AdamWOptimizer(weight_decay, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='AdamW')[source]

Optimizer that implements the Adam algorithm with weight decay.

This is an implementation of the AdamW optimizer described in [“Fixing Weight Decay Regularization in Adam” by Loshchilov & Hutter] (https://arxiv.org/abs/1711.05101) ([pdf](https://arxiv.org/pdf/1711.05101.pdf)).

It computes the update step of train.AdamOptimizer and additionally decays the variable. Note that this is different from adding L2 regularization on the variables to the loss: it regularizes variables with large gradients more than L2 regularization would, which was shown to yield better training loss and generalization error in the paper above.

For further information see the documentation of the Adam Optimizer.

Note that this optimizer can also be instantiated as `python extend_with_weight_decay(tf.compat.v1.train.AdamOptimizer, weight_decay=weight_decay) `

Construct a new AdamW optimizer.

For further information see the documentation of the Adam Optimizer.

Args:

weight_decay: A Tensor or a floating point value. The weight decay. learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay

rate for the 1st moment estimates.
beta2: A float value or a constant float tensor. The exponential decay
rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability. This epsilon is
“epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.

use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “Adam”.

AddSign

class tensorflow.contrib.opt.python.training.addsign.AddSignOptimizer(learning_rate=0.1, alpha=1.0, beta=0.9, sign_decay_fn=None, use_locking=False, name='AddSignOptimizer')[source]

Optimizer that implements the AddSign update.

See [Bello et al., ICML2017], [Neural Optimizer Search with RL](https://arxiv.org/abs/1709.07417).

Constructs a new AddSignOptimizer object.

Initialization:

` m_0 <- 0 (Initialize initial 1st moment vector) t <- 0 (Initialize timestep) `

Update:

` t <- t + 1 m_t <- beta1 * m_{t-1} + (1 - beta1) * g sign_decay <- sign_decay_fn(t) update <- (alpha + sign_decay * sign(g) *sign(m)) * g variable <- variable - lr_t * update `

Example for AddSign-ld (AddSign with linear sign decay) ` decay_steps = 1000 linear_decay_fn = sign_decays.get_linear_decay_fn(decay_steps) opt = AddSignOptimizer(learning_rate=0.1, sign_decay_fn=linear_decay_fn) `

Args:

learning_rate: learning_rate used when taking a step. alpha: alpha used in optimizer. beta: decay used for computing the moving average m. sign_decay_fn: decay function applied to the sign(g) sign(m) quantity.

Takes global_step as an argument. See sign_decay.py for some examples.

use_locking: If True, use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “AddSignOptimizer”.
apply_gradients(grads_and_vars, global_step=None, name=None)[source]

Apply gradients to variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Args:
grads_and_vars: List of (gradient, variable) pairs as returned by
compute_gradients().
global_step: Optional Variable to increment by one after the
variables have been updated.
name: Optional name for the returned operation. Default to the
name passed to the Optimizer constructor.
Returns:
An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
Raises:
TypeError: If grads_and_vars is malformed. ValueError: If none of the variables have gradients. RuntimeError: If you should use _distributed_apply() instead.

AGN

class tensorflow.contrib.opt.python.training.agn_optimizer.AGNOptimizer(optimizer, num_worker, custom_getter, communication_period=10, use_locking=True, name='AGNOptimizer')[source]

Wrapper that implements the Accumulated GradientNormalization algorithm.

Reference:
Accumulated Gradient Normalization: Joeri Hermans ACML2017 https://arxiv.org/abs/1710.02368

Construct a new AGN optimizer.

Args:

optimizer: input optimizer, can be sgd/momentum/adam etc. num_worker: The number of workers custom_getter: The AGNCustomGetter communication_period: An int point value to controls the frequency of the

communication between every worker and the ps.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “AGNOptimizer”.
apply_gradients(grads_and_vars, global_step=None, name=None)[source]

Apply gradients to global variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Args:
grads_and_vars: List of (gradient, variable) pairs as returned by
compute_gradients().
global_step: Optional Variable to increment by one after the variables
have been updated.
name: Optional name for the returned operation. Default to the name
passed to the Optimizer constructor.
Returns:
An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
get_init_op(task_index)[source]

Returns the op to let all the local variables and local center

variables equal to the global center variables before the training begins

make_session_run_hook(is_chief, task_index)[source]

Creates a hook to handle AGNOptimizerHook ops such as initialization.

AMSGrad

class returnn.tf.updater.AMSGradOptimizer(learning_rate=0.001, decay=False, beta1=0.9, beta2=0.99, epsilon=0.0, var_list=())[source]

https://colab.research.google.com/notebook#fileId=1xXFAuHM2Ae-OmF5M8Cn9ypGCa_HHBgfG&scrollTo=N1-2wPHN1Otn https://openreview.net/pdf?id=ryQu7f-RZ https://keras.io/optimizers/ http://ruder.io/deep-learning-optimization-2017/index.html#fixingtheexponentialmovingaverage https://github.com/taki0112/AMSGrad-Tensorflow

apply_gradients(gradient_variables)[source]
Parameters:gradient_variables (list[(tf.Tensor,tf.Variable)]) –
Return type:tf.Operation

BaseCustom

class returnn.tf.updater.BaseCustomOptimizer(learning_rate, use_locking=False, name=None)[source]

Base class for our own optimizer implementations. This simplifies the interface to be implemented a bit from Optimizer. You just have to implement _apply() here. See CustomGradientDescentOptimizer or CustomAdamOptimizer for as an example.

Construct a new optimizer.

Args:
learning_rate: A Tensor or a floating point value. The learning
rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.

CustomAdam

class returnn.tf.updater.CustomAdamOptimizer(beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)[source]

Reimplementation of Adam. See also tf.compat.v1.train.AdamOptimizer.

``` t <- t + 1 lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon) ```

Parameters:
  • beta1 (float) – used for the running average of g (m)
  • beta2 (float) – used for the running average of g*g (v)
  • epsilon (float) –

CustomGradientDescent

class returnn.tf.updater.CustomGradientDescentOptimizer(learning_rate, use_locking=False, name=None)[source]

Just an example implementation for simple gradient descent.

Construct a new optimizer.

Args:
learning_rate: A Tensor or a floating point value. The learning
rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.

DropStaleGradient

class tensorflow.contrib.opt.python.training.drop_stale_gradient_optimizer.DropStaleGradientOptimizer(opt, staleness, use_locking=False, name='DropStaleGradient')[source]

Wrapper optimizer that checks and drops stale gradient.

This optimizer records the global step for each worker before computing gradients and compares it with the global step at the time of applying the gradients. If the difference is larger than a threshold, it will drop all the computed gradients.

Constructs a new DropStaleGradientOptimizer.

Args:
opt: The actual optimizer that will be used to compute and apply the
gradients. Must be one of the Optimizer classes.

staleness: The maximum staleness allowed for the optimizer. use_locking: If True use locks for clip update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “DropStaleGradient”.
compute_gradients(loss, *args, **kwargs)[source]

Compute gradients of loss for the variables in var_list.

This is the first part of minimize(). It returns a list of (gradient, variable) pairs where “gradient” is the gradient for “variable”. Note that “gradient” can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.

Args:
loss: A Tensor containing the value to minimize or a callable taking
no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.
var_list: Optional list or tuple of tf.Variable to update to minimize
loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
gate_gradients: How to gate the computation of gradients. Can be
GATE_NONE, GATE_OP, or GATE_GRAPH.
aggregation_method: Specifies the method used to combine gradient terms.
Valid values are defined in the class AggregationMethod.
colocate_gradients_with_ops: If True, try colocating gradients with
the corresponding op.

grad_loss: Optional. A Tensor holding the gradient computed for loss.

Returns:
A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.
Raises:

TypeError: If var_list contains anything else than Variable objects. ValueError: If some arguments are invalid. RuntimeError: If called with eager execution enabled and loss is

not callable.

@compatibility(eager) When eager execution is enabled, gate_gradients, aggregation_method, and colocate_gradients_with_ops are ignored. @end_compatibility

get_slot(*args, **kwargs)[source]

Return a slot named name created for var by the Optimizer.

Some Optimizer subclasses use additional variables. For example Momentum and Adagrad use variables to accumulate updates. This method gives access to these Variable objects if for some reason you need them.

Use get_slot_names() to get the list of slot names created by the Optimizer.

Args:
var: A variable passed to minimize() or apply_gradients(). name: A string.
Returns:
The Variable for the slot if it was created, None otherwise.
get_slot_names(*args, **kwargs)[source]

Return a list of the names of slots created by the Optimizer.

See get_slot().

Returns:
A list of strings.
apply_gradients(grads_and_vars, global_step=None, name=None)[source]

Apply gradients to variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Args:
grads_and_vars: List of (gradient, variable) pairs as returned by
compute_gradients().
global_step: Optional Variable to increment by one after the
variables have been updated.
name: Optional name for the returned operation. Default to the
name passed to the Optimizer constructor.
Returns:
An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
Raises:
TypeError: If grads_and_vars is malformed. ValueError: If none of the variables have gradients. RuntimeError: If you should use _distributed_apply() instead.

ElasticAverage

class tensorflow.contrib.opt.python.training.elastic_average_optimizer.ElasticAverageOptimizer(opt, num_worker, ea_custom_getter, communication_period=10, moving_rate=None, rho=None, use_locking=True, synchronous=False, name='ElasticAverageOptimizer')[source]

Wrapper optimizer that implements the Elastic Average SGD algorithm.

This is an async optimizer. During the training, Each worker will update the local variables and maintains its own local_step, which starts from 0 and is incremented by 1 after each update of local variables. Whenever the communication period divides the local step, the worker requests the current global center variables and then computed the elastic difference between global center variables and local variables. The elastic difference then be used to update both local variables and global variables.

Construct a new gradient descent optimizer.

Args:
opt: The actual optimizer that will be used to update local variables.
Must be one of the Optimizer classes.

num_worker: The number of workers ea_custom_getter: The ElasticAverageCustomGetter communication_period: An int point value to controls the frequency of the

communication between every worker and the ps.

moving_rate: A floating point value to control the elastic difference. rho: the amount of exploration we allow in the model. The default value is

moving_rate/learning_rate rho=0.0 is suggested in async mode.

use_locking: If True use locks for update operations. synchronous: Add_sync_queues_and_barrier or not.

True: all workers will wait for each other before start training False: worker can start training when its initilization is done,

no need to wait for everyone is ready. in case one worker is restarted, it can join and continue training without being blocked.
name: Optional name prefix for the operations created when applying
gradients. Defaults to “ElasticAverageOptimizer”.
BETA = 0.9[source]
compute_gradients(loss, var_list=None, gate_gradients=1, aggregation_method=None, colocate_gradients_with_ops=False, grad_loss=None)[source]

Compute gradients of loss for the variables in var_list.

Add rho*elastic_difference to loss to control the exploration This is the first part of minimize(). It returns a list of (gradient, variable) pairs where “gradient” is the gradient for “variable”. Note that “gradient” can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.

Args:

loss: A Tensor containing the value to minimize. var_list: Optional list or tuple of tf.Variable to update to minimize

loss. Defaults to the list of variables collected in the graph under the key GraphKey.TRAINABLE_VARIABLES.
gate_gradients: How to gate the computation of gradients. Can be
GATE_NONE, GATE_OP, or GATE_GRAPH.
aggregation_method: Specifies the method used to combine gradient terms.
Valid values are defined in the class AggregationMethod.
colocate_gradients_with_ops: If True, try colocating gradients with the
corresponding op.

grad_loss: Optional. A Tensor holding the gradient computed for loss.

Returns:
A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.
Raises:
TypeError: If var_list contains anything else than Variable objects. ValueError: If some arguments are invalid.
apply_gradients(grads_and_vars, global_step=None, name=None)[source]

Apply gradients to global variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Args:
grads_and_vars: List of (gradient, variable) pairs as returned by
compute_gradients().
global_step: Optional Variable to increment by one after the variables
have been updated.
name: Optional name for the returned operation. Default to the name
passed to the Optimizer constructor.
Returns:
An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
Raises:
TypeError: If grads_and_vars is malformed. ValueError: If none of the variables have gradients.
get_init_op(task_index)[source]

Returns the op to let all the local variables and local center

variables equal to the global center variables before the training begins

make_session_run_hook(is_chief, task_index)[source]

Creates a hook to handle ElasticAverageOptimizerHook ops such as initialization.

swapping_saver(var_list=None, name='swapping_saver', **kwargs)[source]

Create a saver copy global_center_variable to trainable variables

Please call this function after all your variables created with ElasticAverageCustomGetter. For evaluations or inference, use this saver during training. It will save the global_center_variable of the trained parameters under the original parameter names. Args:

var_list: List of variables to save, as per Saver(). If set to None,
save all the trainable_variables that have been created before this call.

name: The name of the saver. **kwargs: Keyword arguments of Saver().

Returns:
A tf.compat.v1.train.Saver object.
Raises:
RuntimeError: global_center_variable is empty, please make sure
this is called after model created and ElasticAverageCustomGetter is used when declaring you model

Ftrl

class tensorflow.python.training.ftrl.FtrlOptimizer(learning_rate, learning_rate_power=-0.5, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='Ftrl', accum_name=None, linear_name=None, l2_shrinkage_regularization_strength=0.0)[source]

Optimizer that implements the FTRL algorithm.

See this [paper]( https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf). This version has support for both online L2 (the L2 penalty given in the paper above) and shrinkage-type L2 (which is the addition of an L2 penalty to the loss function).

Construct a new FTRL optimizer.

Args:

learning_rate: A float value or a constant float Tensor. learning_rate_power: A float value, must be less or equal to zero.

Controls how the learning rate decreases during training. Use zero for a fixed learning rate. See section 3.1 in the [paper](https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf).
initial_accumulator_value: The starting value for accumulators.
Only zero or positive values are allowed.
l1_regularization_strength: A float value, must be greater than or
equal to zero.
l2_regularization_strength: A float value, must be greater than or
equal to zero.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Ftrl”.
accum_name: The suffix for the variable that keeps the gradient squared
accumulator. If not present, defaults to name.
linear_name: The suffix for the variable that keeps the linear gradient
accumulator. If not present, defaults to name + “_1”.
l2_shrinkage_regularization_strength: A float value, must be greater than

or equal to zero. This differs from L2 above in that the L2 above is a stabilization penalty, whereas this L2 shrinkage is a magnitude penalty. The FTRL formulation can be written as: w_{t+1} = argmin_w(hat{g}_{1:t}w + L1*||w||_1 + L2*||w||_2^2), where hat{g} = g + (2*L2_shrinkage*w), and g is the gradient of the loss function w.r.t. the weights w. Specifically, in the absence of L1 regularization, it is equivalent to the following update rule: w_{t+1} = w_t - lr_t / (1 + 2*L2*lr_t) * g_t -

2*L2_shrinkage*lr_t / (1 + 2*L2*lr_t) * w_t

where lr_t is the learning rate at t. When input is sparse shrinkage will only happen on the active weights.

Raises:
ValueError: If one of the arguments is invalid.

GGT

class tensorflow.contrib.opt.python.training.ggt.GGTOptimizer(learning_rate=0.001, beta1=0.9, use_locking=False, name='GGT', window=10, eps=0.0001, svd_eps=1e-06, sigma_eps=0.01)[source]

Optimizer that implements the GGT algorithm.

GGT has an advantage over sgd and adam on large models with poor conditioning, for example language models and CNNs, see [[ABCHSZZ 2018]](https://arxiv.org/pdf/1806.02958.pdf).

Construct a new GGT optimizer.

Initialization:

``` t <- 0 (Initialize timestep) grad_buffer <- 0 (Initialize buffer for keeping past gradients) flat_grad <- 0 (Initialize flattened gradient that contains gradients of all

variables)

m_0 <- 0 (Initialize 1st moment vector) ```

Suppose all variables and their gradients are concatenated into vectors flat_vars and flat_grad. The update rule for flat_vars uses an optimization described at the beginning of section 2 of the paper:

``` t <- t + 1

m_t <- beta1 * m_{t-1} + (1 - beta1) * flat_grad grad_buffer[(t-1) % window, :] <- m_t

M <- grad_buffer^T / sqrt(min(t, window)) U, sigma, _ <- SVD(M^TM + I * svd_eps)

sigma_sqrt_inv <- (sqrt(sigma) + sigma_eps)^(-3) sigma_sqrt_min <- min(sqrt(sigma))

if sigma_sqrt_min > eps:
new_step <- M U diag(sigma_sqrt_inv) U^T M^T m_t +
(m_t - M U diag(1/sigma) U^T M^T m_t) / sigma_sqrt_min
else:
new_step <- M U diag(sigma_sqrt_inv) U^T M^T m_t

flat_vars <- flat_vars - learning_rate * new_step ```

GGT provides the power of full-matrix adaptive regularization at a cost not much larger than SGD. As a result it is suited for large models where the gradient covariance matrix has a poor condition number that slows down first order methods. GGT uses the preconditioner from full-matrix AdaGrad, with gradient history attenuated exponentially as in Adam, and truncated to a window parameter. It has provable guarantees even for non-convex optimization that is never significantly worse than SGD and in some cases better.

Args:

learning_rate: A float hyperparameter. The learning rate. beta1: A float hyperparameter. The exponential decay rate for the 1st

moment estimates.

use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “GGT”.
window: An integer hyperparameter. The number of first moments to keep in
computing the adaptive preconditioner.
eps: A float hyperparameter. Used to truncate small eigenvalues of the
gradient covariance matrix.

svd_eps: A float hyperparameter. Used to stabilize SVD. sigma_eps: A float hyperparameter. Used to regularize matrix inversion.

GradientDescent

class tensorflow.python.training.gradient_descent.GradientDescentOptimizer(learning_rate, use_locking=False, name='GradientDescent')[source]

Optimizer that implements the gradient descent algorithm.

Construct a new gradient descent optimizer.

Args:
learning_rate: A Tensor or a floating point value. The learning
rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “GradientDescent”.

@compatibility(eager) When eager execution is enabled, learning_rate can be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

GradVarianceScaled

class returnn.tf.updater.GradVarianceScaledOptimizer(beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)[source]

Let m be the running average of g. Calculation of m: m_t <- beta1 * m_{t-1} + (1 - beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9

Let v be the running average of the variance of g, i.e. of (g - m)^2.

Parameters:
  • beta1 (float) – used for the running average of g (m)
  • beta2 (float) – used for the running average of variance of g (v)
  • epsilon (float) –

Keras

class returnn.tf.updater.KerasOptimizer(optimizer, name=None)[source]

Wraps a TF optimizer into a standard TF optimizer.

Parameters:
  • optimizer (tf.keras.optimizers.Optimizer) –
  • name (str|None) –
classmethod get_factory(keras_class)[source]
Parameters:keras_class (type[T]) – e.g. tf.keras.optimizers.Nadam

:return function (kwargs)->Optimizer

LARS

class tensorflow.contrib.opt.python.training.lars_optimizer.LARSOptimizer(learning_rate, momentum=0.9, weight_decay=0.0001, eeta=0.001, epsilon=0.0, name='LARSOptimizer', skip_list=None, use_nesterov=False)[source]

Layer-wise Adaptive Rate Scaling for large batch training.

Introduced by “Large Batch Training of Convolutional Networks” by Y. You, I. Gitman, and B. Ginsburg. (https://arxiv.org/abs/1708.03888)

Implements the LARS learning rate scheme presented in the paper above. This optimizer is useful when scaling the batch size to up to 32K without significant performance degradation. It is recommended to use the optimizer in conjunction with:

  • Gradual learning rate warm-up
  • Linear learning rate scaling
  • Poly rule learning rate decay

Note, LARS scaling is currently only enabled for dense tensors. Sparse tensors use the default momentum optimizer.

Construct a new LARS Optimizer.

Args:

learning_rate: A Tensor or floating point value. The base learning rate. momentum: A floating point value. Momentum hyperparameter. weight_decay: A floating point value. Weight decay hyperparameter. eeta: LARS coefficient as used in the paper. Dfault set to LARS

coefficient from the paper. (eeta / weight_decay) determines the highest scaling factor in LARS.
epsilon: Optional epsilon parameter to be set in models that have very
small gradients. Default set to 0.0.

name: Optional name prefix for variables and ops created by LARSOptimizer. skip_list: List of strings to enable skipping variables from LARS scaling.

If any of the strings in skip_list is a subset of var.name, variable ‘var’ is skipped from LARS scaling. For a typical classification model with batch normalization, the skip_list is [‘batch_normalization’, ‘bias’]

use_nesterov: when set to True, nesterov momentum will be enabled

Raises:
ValueError: If a hyperparameter is set to a non-sensical value.
compute_lr(grad, var)[source]

LazyAdam

class tensorflow.contrib.opt.python.training.lazy_adam_optimizer.LazyAdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')[source]

Variant of the Adam optimizer that handles sparse updates more efficiently.

The original Adam algorithm maintains two moving-average accumulators for each trainable variable; the accumulators are updated at every step. This class provides lazier handling of gradient updates for sparse variables. It only updates moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.

Construct a new Adam optimizer.

Initialization:

$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$

The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper:

$$t := t + 1$$ $$lr_t := text{learning_rate} * sqrt{1 - beta_2^t} / (1 - beta_1^t)$$

$$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$variable := variable - lr_t * m_t / (sqrt{v_t} + epsilon)$$

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).

Args:

learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay

rate for the 1st moment estimates.
beta2: A float value or a constant float tensor. The exponential decay
rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability. This epsilon is
“epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.

use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “Adam”. @compatibility(eager) When eager execution is enabled, learning_rate, beta1, beta2, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

LazyAdamGS

class tensorflow.contrib.opt.python.training.lazy_adam_gs_optimizer.LazyAdamGSOptimizer(global_step=0, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')[source]

Variant of the Adam optimizer that handles sparse updates more efficiently.

Branched from tf.contrib.opt.LazyAdamGSOptimizer. The only difference is to pass global step for computing beta1 and beta2 accumulators, instead of having optimizer keep its own independent beta1 and beta2 accumulators as non-slot variables.

The original Adam algorithm maintains two moving-average accumulators for each trainable variable; the accumulators are updated at every step. This class provides lazier handling of gradient updates for sparse variables. It only updates moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.

Construct a new Adam optimizer.

Branched from tf.train.AdamOptimizer. The only difference is to pass global step for computing beta1 and beta2 accumulators, instead of having optimizer keep its own independent beta1 and beta2 accumulators as non-slot variables.

Initialization:

$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$

The update rule for variable with gradient g uses an optimization described at the end of section2 of the paper:

$$t := t + 1$$ $$lr_t := text{learning_rate} * sqrt{1 - beta_2^t} / (1 - beta_1^t)$$

$$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$variable := variable - lr_t * m_t / (sqrt{v_t} + epsilon)$$

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).

Args:

global_step: tensorflow variable indicating the step. learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay

rate for the 1st moment estimates.
beta2: A float value or a constant float tensor. The exponential decay
rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability. This epsilon is
“epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.

use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “Adam”. @compatibility(eager) When eager execution is enabled, learning_rate, beta1, beta2, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

ModelAverage

class tensorflow.contrib.opt.python.training.model_average_optimizer.ModelAverageOptimizer(opt, num_worker, is_chief, ma_custom_getter, interval_steps=100, use_locking=True, name='ModelAverageOptimizer')[source]

Wrapper optimizer that implements the Model Average algorithm.

This is a sync optimizer. During the training, each worker will update the local variables and maintains its own local_step, which starts from 0 and is incremented by 1 after each update of local variables. Whenever the interval_steps divides the local step, the local variables from all the workers will be averaged and assigned to global center variables. Then the local variables will be assigned by global center variables.

Construct a new model average optimizer.

Args:

opt: The actual optimizer that will be used to update local variables num_worker: The number of workers is_chief: whether chief worker ma_custom_getter: ModelAverageCustomGetter interval_steps: An int point value to controls the frequency of the

average of local variables

use_locking: If True use locks for update operations name: string. Optional name of the returned operation

compute_gradients(*args, **kwargs)[source]

Compute gradients of “loss” for the variables in “var_list”.

This simply wraps the compute_gradients() from the real optimizer.

Args:
*args: Arguments for compute_gradients(). **kwargs: Keyword arguments for compute_gradients().
Returns:
A list of (gradient, variable) pairs.
apply_gradients(grads_and_vars, global_step=None, name=None)[source]

Apply gradients to variables.

This contains most of the synchronization implementation and also wraps the apply_gradients() from the real optimizer. The chief work updates global variables.

Args:
grads_and_vars: List of (gradient, variable) pairs as returned by
compute_gradients().
global_step: Optional Variable to increment by one after the variables
have been updated.
name: Optional name for the returned operation. Default to the name
passed to the Optimizer constructor.
Returns:
A conditional ‘Operation’ that update both local and global variables or just local variables
Raises:

ValueError: If the grads_and_vars is empty. ValueError: If global step is not provided, the staleness cannot be

checked.
get_init_op()[source]

Returns the op.

This method lets all the local variables equal to the global variables before the training begins.

make_session_run_hook()[source]

Creates a hook to handle ModelAverage ops such as initialization.

Momentum

class tensorflow.python.training.momentum.MomentumOptimizer(learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=False)[source]

Optimizer that implements the Momentum algorithm.

Computes (if use_nesterov = False):

` accumulation = momentum * accumulation + gradient variable -= learning_rate * accumulation `

Note that in the dense version of this algorithm, accumulation is updated and applied regardless of a gradient’s value, whereas the sparse version (when the gradient is an IndexedSlices, typically because of tf.gather or an embedding) only updates variable slices and corresponding accumulation terms when that part of the variable was used in the forward pass.

Construct a new Momentum optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. momentum: A Tensor or a floating point value. The momentum. use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Momentum”.
use_nesterov: If True use Nesterov Momentum.
See [Sutskever et al., 2013]( http://jmlr.org/proceedings/papers/v28/sutskever13.pdf). This implementation always computes gradients at the value of the variable(s) passed to the optimizer. Using Nesterov Momentum makes the variable(s) track the values called theta_t + mu*v_t in the paper. This implementation is an approximation of the original formula, valid for high values of momentum. It will compute the “adjusted gradient” in NAG by assuming that the new gradient will be estimated by the current average gradient plus the product of momentum and the change in the average gradient.

@compatibility(eager) When eager execution is enabled, learning_rate and momentum can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

MomentumW

class tensorflow.contrib.opt.python.training.weight_decay_optimizers.MomentumWOptimizer(weight_decay, learning_rate, momentum, use_locking=False, name='MomentumW', use_nesterov=False)[source]

Optimizer that implements the Momentum algorithm with weight_decay.

This is an implementation of the SGDW optimizer described in “Fixing Weight Decay Regularization in Adam” by Loshchilov & Hutter (https://arxiv.org/abs/1711.05101) ([pdf])(https://arxiv.org/pdf/1711.05101.pdf). It computes the update step of train.MomentumOptimizer and additionally decays the variable. Note that this is different from adding L2 regularization on the variables to the loss. Decoupling the weight decay from other hyperparameters (in particular the learning rate) simplifies hyperparameter search.

For further information see the documentation of the Momentum Optimizer.

Note that this optimizer can also be instantiated as ```python extend_with_weight_decay(tf.compat.v1.train.MomentumOptimizer,

weight_decay=weight_decay)

```

Construct a new MomentumW optimizer.

For further information see the documentation of the Momentum Optimizer.

Args:

weight_decay: A Tensor or a floating point value. The weight decay. learning_rate: A Tensor or a floating point value. The learning rate. momentum: A Tensor or a floating point value. The momentum. use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Momentum”.
use_nesterov: If True use Nesterov Momentum. See [Sutskever et al.,

2013]( http://jmlr.org/proceedings/papers/v28/sutskever13.pdf). This

implementation always computes gradients at the value of the variable(s) passed to the optimizer. Using Nesterov Momentum makes the variable(s) track the values called theta_t + mu*v_t in the paper. @compatibility(eager) When eager execution is enabled, learning_rate, weight_decay and momentum can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

MovingAverage

class tensorflow.contrib.opt.python.training.moving_average_optimizer.MovingAverageOptimizer(opt, average_decay=0.9999, num_updates=None, sequential_update=True)[source]

Optimizer that computes a moving average of the variables.

Empirically it has been found that using the moving average of the trained parameters of a deep network is better than using its trained parameters directly. This optimizer allows you to compute this moving average and swap the variables at save time so that any code outside of the training loop will use by default the averaged values instead of the original ones.

Example of usage:

```python

// Encapsulate your favorite optimizer (here the momentum one) // inside the MovingAverageOptimizer. opt = tf.compat.v1.train.MomentumOptimizer(learning_rate, FLAGS.momentum) opt = tf.contrib.opt.MovingAverageOptimizer(opt) // Then create your model and all its variables. model = build_model() // Add the training op that optimizes using opt. // This needs to be called before swapping_saver(). opt.minimize(cost, var_list) // Then create your saver like this: saver = opt.swapping_saver() // Pass it to your training loop.

slim.learning.train(
model, … saver=saver)

```

Note that for evaluation, the normal saver should be used instead of swapping_saver().

Construct a new MovingAverageOptimizer.

Args:

opt: A tf.Optimizer that will be used to compute and apply gradients. average_decay: Float. Decay to use to maintain the moving averages

of trained variables. See tf.train.ExponentialMovingAverage for details.
num_updates: Optional count of number of updates applied to variables.
See tf.train.ExponentialMovingAverage for details.
sequential_update: Bool. If False, will compute the moving average at the
same time as the model is updated, potentially doing benign data races. If True, will update the moving average after gradient updates.
compute_gradients(*args, **kwargs)[source]

Compute gradients of loss for the variables in var_list.

This is the first part of minimize(). It returns a list of (gradient, variable) pairs where “gradient” is the gradient for “variable”. Note that “gradient” can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.

Args:
loss: A Tensor containing the value to minimize or a callable taking
no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.
var_list: Optional list or tuple of tf.Variable to update to minimize
loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
gate_gradients: How to gate the computation of gradients. Can be
GATE_NONE, GATE_OP, or GATE_GRAPH.
aggregation_method: Specifies the method used to combine gradient terms.
Valid values are defined in the class AggregationMethod.
colocate_gradients_with_ops: If True, try colocating gradients with
the corresponding op.

grad_loss: Optional. A Tensor holding the gradient computed for loss.

Returns:
A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.
Raises:

TypeError: If var_list contains anything else than Variable objects. ValueError: If some arguments are invalid. RuntimeError: If called with eager execution enabled and loss is

not callable.

@compatibility(eager) When eager execution is enabled, gate_gradients, aggregation_method, and colocate_gradients_with_ops are ignored. @end_compatibility

apply_gradients(grads_and_vars, global_step=None, name=None)[source]

Apply gradients to variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Args:
grads_and_vars: List of (gradient, variable) pairs as returned by
compute_gradients().
global_step: Optional Variable to increment by one after the
variables have been updated.
name: Optional name for the returned operation. Default to the
name passed to the Optimizer constructor.
Returns:
An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
Raises:
TypeError: If grads_and_vars is malformed. ValueError: If none of the variables have gradients. RuntimeError: If you should use _distributed_apply() instead.
swapping_saver(var_list=None, name='swapping_saver', **kwargs)[source]

Create a saver swapping moving averages and variables.

You should use this saver during training. It will save the moving averages of the trained parameters under the original parameter names. For evaluations or inference you should use a regular saver and it will automatically use the moving averages for the trained variable.

You must call this function after all variables have been created and after you have called Optimizer.minimize().

Args:
var_list: List of variables to save, as per Saver().
If set to None, will save all the variables that have been created before this call.

name: The name of the saver. **kwargs: Keyword arguments of Saver().

Returns:
A tf.compat.v1.train.Saver object.
Raises:

RuntimeError: If apply_gradients or minimize has not been called before. ValueError: If var_list is provided and contains some variables but not

their moving average counterpart.

Nadam

class tensorflow.contrib.opt.python.training.nadam_optimizer.NadamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')[source]

Optimizer that implements the Nadam algorithm.

See [Dozat, T., 2015](http://cs229.stanford.edu/proj2015/054_report.pdf).

Construct a new Adam optimizer.

Initialization:

$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$

The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper:

$$t := t + 1$$ $$lr_t := text{learning_rate} * sqrt{1 - beta_2^t} / (1 - beta_1^t)$$

$$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$variable := variable - lr_t * m_t / (sqrt{v_t} + epsilon)$$

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).

Args:

learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay

rate for the 1st moment estimates.
beta2: A float value or a constant float tensor. The exponential decay
rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability. This epsilon is
“epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.

use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.

Defaults to “Adam”. @compatibility(eager) When eager execution is enabled, learning_rate, beta1, beta2, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

NeuralO

class returnn.tf.updater.NeuralOptimizer1(beta1=0.9, decrease_factor=0.1, **kwargs)[source]

Via Neural Optimizer Search with Reinforcement Learning (http://proceedings.mlr.press/v70/bello17a/bello17a.pdf).

Equivalent to the optimizer g * exp(sign(g) * sign(m)), we use:

g * where(sign(g) == sign(m), 1.0, decrease_factor)

where m is the running average of g.

Calculation of m: m_t <- beta1 * m_{t-1} + (1 - beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9

Parameters:
  • beta1 (float) – used for the running average of m
  • decrease_factor (float) – in the original paper, it is e^-2 ~= 0.135

Norm

class returnn.tf.updater.NormalizedSGD(learning_rate, use_locking=False, name=None)[source]

All grads are L2 normalized (via tf.nn.l2_normalize()), otherwise it’s standard SGD. Via: https://github.com/kmkolasinski/deep-learning-notes/tree/master/max-normed-optimizer

Construct a new optimizer.

Args:
learning_rate: A Tensor or a floating point value. The learning
rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.

PowerSign

class tensorflow.contrib.opt.python.training.powersign.PowerSignOptimizer(learning_rate=0.1, base=2.718281828459045, beta=0.9, sign_decay_fn=None, use_locking=False, name='PowerSignOptimizer')[source]

Optimizer that implements the PowerSign update.

See [Bello et al., ICML2017], [Neural Optimizer Search with RL](https://arxiv.org/abs/1709.07417).

Constructs a new PowerSignOptimizer object.

Initialization:

` m_0 <- 0 (Initialize initial 1st moment vector) t <- 0 (Initialize timestep) `

Update:

` t <- t + 1 m_t <- beta1 * m_{t-1} + (1 - beta1) * g sign_decay <- sign_decay_fn(t) update <- base ** (sign_decay * sign(g) * sign(m)) * g variable <- variable - lr_t * update `

Example usage for PowerSign-cd (PowerSign with cosine sign decay) ` decay_steps = 1000 linear_decay_fn = sign_decays.get_cosine_decay_fn(decay_steps) opt = PowerSignOptimizer(learning_rate=0.1, sign_decay_fn=linear_decay_fn) `

Args:

learning_rate: learning_rate used when taking a step. base: base used in optimizer. beta: decay used for computing the moving average m. sign_decay_fn: decay function applied to the sign(g) sign(m) quantity.

Takes global_step as an argument. See sign_decay.py for some examples.

use_locking: If True, use locks for update operations. name: Optional name for the operations created iwhen applying gradients.

Defaults to “PowerSignOptimizer”.
apply_gradients(grads_and_vars, global_step=None, name=None)[source]

Apply gradients to variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Args:
grads_and_vars: List of (gradient, variable) pairs as returned by
compute_gradients().
global_step: Optional Variable to increment by one after the
variables have been updated.
name: Optional name for the returned operation. Default to the
name passed to the Optimizer constructor.
Returns:
An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
Raises:
TypeError: If grads_and_vars is malformed. ValueError: If none of the variables have gradients. RuntimeError: If you should use _distributed_apply() instead.

ProximalAdagrad

class tensorflow.python.training.proximal_adagrad.ProximalAdagradOptimizer(learning_rate, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='ProximalAdagrad')[source]

Optimizer that implements the Proximal Adagrad algorithm.

See this [paper](http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf).

Construct a new ProximalAdagrad optimizer.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. initial_accumulator_value: A floating point value.

Starting value for the accumulators, must be positive.
l1_regularization_strength: A float value, must be greater than or
equal to zero.
l2_regularization_strength: A float value, must be greater than or
equal to zero.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “Adagrad”.
Raises:
ValueError: If the initial_accumulator_value is invalid.

ProximalGradientDescent

class tensorflow.python.training.proximal_gradient_descent.ProximalGradientDescentOptimizer(learning_rate, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='ProximalGradientDescent')[source]

Optimizer that implements the proximal gradient descent algorithm.

See this [paper](http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf).

Construct a new proximal gradient descent optimizer.

Args:
learning_rate: A Tensor or a floating point value. The learning
rate to use.
l1_regularization_strength: A float value, must be greater than or
equal to zero.
l2_regularization_strength: A float value, must be greater than or
equal to zero.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to “GradientDescent”.

RegAdagrad

class tensorflow.contrib.opt.python.training.reg_adagrad_optimizer.RegAdagradOptimizer(learning_rate, initial_accumulator_value=0.1, use_locking=False, name='RegAdagrad')[source]

RegAdagrad: Adagrad with updates that optionally skip updating the slots.

This is meant to address the problem of additional regularization terms in the loss function affecting learning rate decay and causing hyper-param entanglement. Example usage:

loss = tf.nn.cross_entropy(x, labels) reg_loss = reg_strength * tf.reduce_sum(x * x) opt = tf.contrib.opt.RegAdagradOptimizer(learning_rate) loss_update = opt.minimize(loss) with opt.avoid_updating_slots():

reg_update = opt.minimize(reg_loss)

total_update = tf.group([loss_update, reg_update])

# …

sess.run(total_update, …)

avoid_updating_slots()[source]

RMSProp

class tensorflow.python.training.rmsprop.RMSPropOptimizer(learning_rate, decay=0.9, momentum=0.0, epsilon=1e-10, use_locking=False, centered=False, name='RMSProp')[source]

Optimizer that implements the RMSProp algorithm.

See the [paper](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf).

Construct a new RMSProp optimizer.

Note that in the dense implementation of this algorithm, variables and their corresponding accumulators (momentum, gradient moving average, square gradient moving average) will be updated even if the gradient is zero (i.e. accumulators will decay, momentum will be applied). The sparse implementation (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) will not update variable slices or their accumulators unless those slices were used in the forward pass (nor is there an “eventual” correction to account for these omitted updates). This leads to more efficient updates for large embedding lookup tables (where most of the slices are not accessed in a particular graph execution), but differs from the published algorithm.

Args:

learning_rate: A Tensor or a floating point value. The learning rate. decay: Discounting factor for the history/coming gradient momentum: A scalar tensor. epsilon: Small value to avoid zero denominator. use_locking: If True use locks for update operation. centered: If True, gradients are normalized by the estimated variance of

the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.
name: Optional name prefix for the operations created when applying
gradients. Defaults to “RMSProp”.

@compatibility(eager) When eager execution is enabled, learning_rate, decay, momentum, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility

Shampoo

class tensorflow.contrib.opt.python.training.shampoo.ShampooOptimizer(global_step=0, max_matrix_size=768, gbar_decay=0.0, gbar_weight=1.0, mat_gbar_decay=1.0, mat_gbar_weight=1.0, learning_rate=1.0, svd_interval=1, precond_update_interval=1, epsilon=0.0001, alpha=0.5, use_iterative_root=False, use_locking=False, name='Shampoo')[source]

The Shampoo Optimizer

Variant of Adagrad using one preconditioner matrix per variable dimension. For details, see https://arxiv.org/abs/1802.09568

gbar is time-weighted accumulated gradient: gbar[t] = gbar_decay[t] * gbar[t-1] + gbar_weight[t] * g[t]

mat_gbar is time-weighted accumulated gradient square: mat_gbar_j[t] = mat_gbar_decay[t] * mat_gbar_j[t-1]

  • mat_gbar_weight[t] * gg_j[t]

where if g[t] = g_abcd then gg_a[t] = g_abcd g_a’bcd (Einstein notation)

Update rule: w[t+1] = w[t] - learning_rate[t] * Prod_j mat_gbar_j[t]^(-alpha/n) gbar[t]

Again, mat_gbar_j[t]^(-alpha) gbar[t] is a tensor contraction along the j’th dimension of gbar[t] with the first dimension of mat_gbar_j[t]^(-alpha/n), where alpha is a hyperparameter, and n = rank of the variable. Prod_j represents doing this contraction for all j in 0..n-1.

Typically learning_rate is constant, but could be time dependent by passing a lambda function that depends on step.

Default values of the various hyper-parameters.

gbar_decay, gbar_weight etc. can be a float or a time varying parameter. For time-varying parameters use e.g. “lambda T: T / (T + 1.0)” where the expression in the lambda is a tensorflow expression

Args:

global_step: tensorflow variable indicating the step. max_matrix_size: We do not perform SVD for matrices larger than this. gbar_decay: gbar_weight: Used to update gbar:

gbar[t] = gbar_decay[t] * gbar[t-1] + gbar_weight[t] * g[t]

mat_gbar_decay: mat_gbar_weight: Used to update mat_gbar:

mat_gbar_j[t] = mat_gbar_decay[t] * mat_gbar_j[t-1]
  • mat_gbar_weight[t] * gg_j[t]

learning_rate: Similar to SGD svd_interval: We should do SVD after this many steps. Default = 1, i.e.

every step. Usually 20 leads to no loss of accuracy, and 50 or 100 is also OK. May also want more often early, and less often later - set in caller as for example: “svd_interval = lambda(T): tf.cond(

T < 2000, lambda: 20.0, lambda: 1000.0)”
precond_update_interval: We should update the preconditioners after
this many steps. Default = 1. Usually less than svd_interval.
epsilon: epsilon * I_n is added to each mat_gbar_j for stability for
non-diagonal version of shampoo.

alpha: total power of the preconditioners. use_iterative_root: should the optimizer use SVD (faster) or the

iterative root method (for TPU) for finding the roots of PSD matrices.

use_locking: name: name of optimizer.

SyncReplicas

class tensorflow.python.training.sync_replicas_optimizer.SyncReplicasOptimizer(opt, replicas_to_aggregate, total_num_replicas=None, variable_averages=None, variables_to_average=None, use_locking=False, name='sync_replicas')[source]

Class to synchronize, aggregate gradients and pass them to the optimizer.

This class is deprecated. For synchrononous training, please use [Distribution Strategies](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).

In a typical asynchronous training environment, it’s common to have some stale gradients. For example, with a N-replica asynchronous training, gradients will be applied to the variables N times independently. Depending on each replica’s training speed, some gradients might be calculated from copies of the variable from several steps back (N-1 steps on average). This optimizer avoids stale gradients by collecting gradients from all replicas, averaging them, then applying them to the variables in one shot, after which replicas can fetch the new variables and continue.

The following accumulators/queue are created:

  • N gradient accumulators, one per variable to train. Gradients are pushed to them and the chief worker will wait until enough gradients are collected and then average them before applying to variables. The accumulator will drop all stale gradients (more details in the accumulator op).
  • 1 token queue where the optimizer pushes the new global_step value after all variables are updated.

The following local variable is created: * sync_rep_local_step, one per replica. Compared against the global_step in

each accumulator to check for staleness of the gradients.

The optimizer adds nodes to the graph to collect gradients and pause the trainers until variables are updated. For the Parameter Server job:

  1. An accumulator is created for each variable, and each replica pushes the gradients into the accumulators instead of directly applying them to the variables.
  2. Each accumulator averages once enough gradients (replicas_to_aggregate) have been accumulated.
  3. Apply the averaged gradients to the variables.
  4. Only after all variables have been updated, increment the global step.
  5. Only after step 4, pushes global_step in the token_queue, once for each worker replica. The workers can now fetch the global step, use it to update its local_step variable and start the next batch. Please note that some workers can consume multiple minibatches, while some may not consume even one. This is because each worker fetches minibatches as long as a token exists. If one worker is stuck for some reason and does not consume a token, another worker can use it.

For the replicas:

  1. Start a step: fetch variables and compute gradients.
  2. Once the gradients have been computed, push them into gradient accumulators. Each accumulator will check the staleness and drop the stale.
  3. After pushing all the gradients, dequeue an updated value of global_step from the token queue and record that step to its local_step variable. Note that this is effectively a barrier.
  4. Start the next batch.

### Usage

```python # Create any optimizer to update the variables, say a simple SGD: opt = GradientDescentOptimizer(learning_rate=0.1)

# Wrap the optimizer with sync_replicas_optimizer with 50 replicas: at each # step the optimizer collects 50 gradients before applying to variables. # Note that if you want to have 2 backup replicas, you can change # total_num_replicas=52 and make sure this number matches how many physical # replicas you started in your job. opt = tf.compat.v1.train.SyncReplicasOptimizer(opt, replicas_to_aggregate=50,

total_num_replicas=50)

# Some models have startup_delays to help stabilize the model but when using # sync_replicas training, set it to 0.

# Now you can call minimize() or compute_gradients() and # apply_gradients() normally training_op = opt.minimize(total_loss, global_step=self.global_step)

# You can create the hook which handles initialization and queues. sync_replicas_hook = opt.make_session_run_hook(is_chief) ```

In the training program, every worker will run the train_op as if not synchronized.

```python with training.MonitoredTrainingSession(

master=workers[worker_id].target, is_chief=is_chief, hooks=[sync_replicas_hook]) as mon_sess:
while not mon_sess.should_stop():
mon_sess.run(training_op)

```

To use SyncReplicasOptimizer with an Estimator, you need to send sync_replicas_hook while calling the fit. `python my_estimator = DNNClassifier(..., optimizer=opt) my_estimator.fit(..., hooks=[sync_replicas_hook]) `

Construct a sync_replicas optimizer. (deprecated)

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: The SyncReplicaOptimizer class is deprecated. For synchrononous training, please use [Distribution Strategies](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).

Args:
opt: The actual optimizer that will be used to compute and apply the
gradients. Must be one of the Optimizer classes.
replicas_to_aggregate: number of replicas to aggregate for each variable
update.
total_num_replicas: Total number of tasks/workers/replicas, could be
different from replicas_to_aggregate. If total_num_replicas > replicas_to_aggregate: it is backup_replicas + replicas_to_aggregate. If total_num_replicas < replicas_to_aggregate: Replicas compute multiple batches per update to variables.
variable_averages: Optional ExponentialMovingAverage object, used to
maintain moving averages for the variables passed in variables_to_average.
variables_to_average: a list of variables that need to be averaged. Only
needed if variable_averages is passed in.

use_locking: If True use locks for update operation. name: string. Optional name of the returned operation.

compute_gradients(*args, **kwargs)[source]

Compute gradients of “loss” for the variables in “var_list”.

This simply wraps the compute_gradients() from the real optimizer. The gradients will be aggregated in the apply_gradients() so that user can modify the gradients like clipping with per replica global norm if needed. The global norm with aggregated gradients can be bad as one replica’s huge gradients can hurt the gradients from other replicas.

Args:
*args: Arguments for compute_gradients(). **kwargs: Keyword arguments for compute_gradients().
Returns:
A list of (gradient, variable) pairs.
apply_gradients(grads_and_vars, global_step=None, name=None)[source]

Apply gradients to variables.

This contains most of the synchronization implementation and also wraps the apply_gradients() from the real optimizer.

Args:
grads_and_vars: List of (gradient, variable) pairs as returned by
compute_gradients().
global_step: Optional Variable to increment by one after the
variables have been updated.
name: Optional name for the returned operation. Default to the
name passed to the Optimizer constructor.
Returns:
train_op: The op to dequeue a token so the replicas can exit this batch and start the next one. This is executed by each replica.
Raises:

ValueError: If the grads_and_vars is empty. ValueError: If global step is not provided, the staleness cannot be

checked.
get_chief_queue_runner()[source]

Returns the QueueRunner for the chief to execute.

This includes the operations to synchronize replicas: aggregate gradients, apply to variables, increment global step, insert tokens to token queue.

Note that this can only be called after calling apply_gradients() which actually generates this queuerunner.

Returns:
A QueueRunner for chief to execute.
Raises:
ValueError: If this is called before apply_gradients().
get_slot(*args, **kwargs)[source]

Return a slot named “name” created for “var” by the Optimizer.

This simply wraps the get_slot() from the actual optimizer.

Args:
*args: Arguments for get_slot(). **kwargs: Keyword arguments for get_slot().
Returns:
The Variable for the slot if it was created, None otherwise.
variables()[source]

Fetches a list of optimizer variables in the default graph.

This wraps variables() from the actual optimizer. It does not include the SyncReplicasOptimizer’s local step.

Returns:
A list of variables.
get_slot_names(*args, **kwargs)[source]

Return a list of the names of slots created by the Optimizer.

This simply wraps the get_slot_names() from the actual optimizer.

Args:
*args: Arguments for get_slot(). **kwargs: Keyword arguments for get_slot().
Returns:
A list of strings.
get_init_tokens_op(num_tokens=-1)[source]

Returns the op to fill the sync_token_queue with the tokens.

This is supposed to be executed in the beginning of the chief/sync thread so that even if the total_num_replicas is less than replicas_to_aggregate, the model can still proceed as the replicas can compute multiple steps per variable update. Make sure: num_tokens >= replicas_to_aggregate - total_num_replicas.

Args:
num_tokens: Number of tokens to add to the queue.
Returns:
An op for the chief/sync replica to fill the token queue.
Raises:

ValueError: If this is called before apply_gradients(). ValueError: If num_tokens are smaller than replicas_to_aggregate -

total_num_replicas.
make_session_run_hook(is_chief, num_tokens=-1)[source]

Creates a hook to handle SyncReplicasHook ops such as initialization.

VariableClipping

class tensorflow.contrib.opt.python.training.variable_clipping_optimizer.VariableClippingOptimizer(opt, vars_to_clip_dims, max_norm, use_locking=False, colocate_clip_ops_with_vars=False, name='VariableClipping')[source]

Wrapper optimizer that clips the norm of specified variables after update.

This optimizer delegates all aspects of gradient calculation and application to an underlying optimizer. After applying gradients, this optimizer then clips the variable to have a maximum L2 norm along specified dimensions. NB: this is quite different from clipping the norm of the gradients.

Multiple instances of VariableClippingOptimizer may be chained to specify different max norms for different subsets of variables.

This is more efficient at serving-time than using normalization during embedding lookup, at the expense of more expensive training and fewer guarantees about the norms.

@@__init__

Construct a new clip-norm optimizer.

Args:
opt: The actual optimizer that will be used to compute and apply the
gradients. Must be one of the Optimizer classes.
vars_to_clip_dims: A dict with keys as Variables and values as lists
of dimensions along which to compute the L2-norm. See tf.clip_by_norm for more details.

max_norm: The L2-norm to clip to, for all variables specified. use_locking: If True use locks for clip update operations. colocate_clip_ops_with_vars: If True, try colocating the clip norm

ops with the corresponding variable.
name: Optional name prefix for the operations created when applying
gradients. Defaults to “VariableClipping”.
compute_gradients(*args, **kwargs)[source]

Compute gradients of loss for the variables in var_list.

This is the first part of minimize(). It returns a list of (gradient, variable) pairs where “gradient” is the gradient for “variable”. Note that “gradient” can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.

Args:
loss: A Tensor containing the value to minimize or a callable taking
no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.
var_list: Optional list or tuple of tf.Variable to update to minimize
loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
gate_gradients: How to gate the computation of gradients. Can be
GATE_NONE, GATE_OP, or GATE_GRAPH.
aggregation_method: Specifies the method used to combine gradient terms.
Valid values are defined in the class AggregationMethod.
colocate_gradients_with_ops: If True, try colocating gradients with
the corresponding op.

grad_loss: Optional. A Tensor holding the gradient computed for loss.

Returns:
A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.
Raises:

TypeError: If var_list contains anything else than Variable objects. ValueError: If some arguments are invalid. RuntimeError: If called with eager execution enabled and loss is

not callable.

@compatibility(eager) When eager execution is enabled, gate_gradients, aggregation_method, and colocate_gradients_with_ops are ignored. @end_compatibility

get_slot(*args, **kwargs)[source]

Return a slot named name created for var by the Optimizer.

Some Optimizer subclasses use additional variables. For example Momentum and Adagrad use variables to accumulate updates. This method gives access to these Variable objects if for some reason you need them.

Use get_slot_names() to get the list of slot names created by the Optimizer.

Args:
var: A variable passed to minimize() or apply_gradients(). name: A string.
Returns:
The Variable for the slot if it was created, None otherwise.
get_slot_names(*args, **kwargs)[source]

Return a list of the names of slots created by the Optimizer.

See get_slot().

Returns:
A list of strings.
apply_gradients(grads_and_vars, global_step=None, name=None)[source]

Apply gradients to variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Args:
grads_and_vars: List of (gradient, variable) pairs as returned by
compute_gradients().
global_step: Optional Variable to increment by one after the
variables have been updated.
name: Optional name for the returned operation. Default to the
name passed to the Optimizer constructor.
Returns:
An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
Raises:
TypeError: If grads_and_vars is malformed. ValueError: If none of the variables have gradients. RuntimeError: If you should use _distributed_apply() instead.