Parameters:class_name (str) – e.g. “adam”
Returns:the class
Return type:type[Optimizer]|()->Optimizer
class TFUpdater.Updater(config, tf_session, network, initial_learning_rate=1.0)[source]

This will create the tf.train.Optimizer instance given the config and the update-op for all trainable vars. See the code of Updater.create_optimizer() for valid config options.

Note: Vincent Vanhoucke says, in case you get nans, consider increasing the epsilon (for Adam, Nadam and similar). This is the config option optimizer_epsilon. In some places in our Theano code, 1e-16 is our default epsilon, in some other parts, 1e-8 is. 1e-8 might be more stable. Or even 1e-6. Note that when the gradient is suddenly zero in one step, the update can be proportional to lr / eps.

From the tf.train.AdamOptimizer documentation:

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

More from Vincent Vanhoucke:

One thing you can do is run with a tiny learning rate, or even zero learning rate. If you still have divergence then, you have a bug in your setup. If not, increase your rate slowly and see if there is a regime in which things train without diverging. It’s completely possible to have weights that are in a good range, but activations or gradients going to infinity because of the shape of the loss, or too high a learning rate. It’s obviously always a possibility that there is a bug in the optimizers, but in my experience, every single instance of this kind of problem could be traced back to a weirdly wired model, learning rate issues, bad randomization of the input examples, or - in the case of Adam or RMSProp - issues with the epsilon value.

In addition, you might also want to try gradient_nan_inf_filter or maybe set beta1=0.5.

For further debugging, see tf.add_check_numerics_ops() or add_check_numerics_ops_and_debug_print(), which is config option debug_add_check_numerics_ops. Also relevant are config options debug_add_check_numerics_on_output and debug_grad_summaries.


Call this if sth is changed which the optim_op depends on. See self.create_optim_op().

Parameters:trainable_vars (list[tf.Variable]) –
Parameters:value (float) –
Return type:tf.Tensor
Parameters:callback_on_new (None|()->None) –
Return type:tf.Operation
TFUpdater.accum_grad_multiple_step(grad, var, train_step, num_accum_steps)[source]
  • grad (tf.Tensor|tf.IndexedSlices) –
  • var (tf.Variable) –
  • train_step (tf.Tensor) – int, scalar
  • num_accum_steps (int) –

modified grad

Return type:


class TFUpdater.CustomGradientDescentOptimizer(learning_rate, use_locking=False, name=None)[source]

Just an example implementation for simple gradient descent.

Construct a new optimizer.

learning_rate: A Tensor or a floating point value. The learning
rate to use.

use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying

gradients. Defaults to self.__class__.__name__.
class TFUpdater.NeuralOptimizer1(beta1=0.9, decrease_factor=0.1, **kwargs)[source]

Via Neural Optimizer Search with Reinforcement Learning (

Equivalent to the optimizer g * exp(sign(g) * sign(m)), we use:

g * where(sign(g) == sign(m), 1.0, decrease_factor)

where m is the running average of g.

Calculation of m: m_t <- beta1 * m_{t-1} + (1 - beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9

  • beta1 (float) – used for the running average of m
  • decrease_factor (float) – in the original paper, it is e^-2 ~= 0.135
class TFUpdater.GradVarianceScaledOptimizer(beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)[source]

Let m be the running average of g. Calculation of m: m_t <- beta1 * m_{t-1} + (1 - beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9

Let v be the running average of the variance of g, i.e. of (g - m)^2.

  • beta1 (float) – used for the running average of g (m)
  • beta2 (float) – used for the running average of variance of g (v)
  • epsilon (float) –
class TFUpdater.AMSGradOptimizer(learning_rate=0.001, decay=False, beta1=0.9, beta2=0.99, epsilon=0.0, var_list=[])[source]