returnn.tf.updater
#
This module covers the optimizer (SGD, Adam, etc) logic, and model param update logic in general.
- returnn.tf.updater.register_optimizer_class(cls, name=None)[source]#
- Parameters:
cls (type[Optimizer|KerasOptimizer]) –
name (str|None) –
- returnn.tf.updater.get_optimizer_class(class_name)[source]#
- Parameters:
class_name (str|function|type[Optimizer|KerasOptimizer]) – e.g. “adam”
- Returns:
the class
- Return type:
type[Optimizer|KerasOptimizer]
- class returnn.tf.updater.Updater(config, network, initial_learning_rate=1.0)[source]#
This will create the
tf.compat.v1.train.Optimizer
instance given the config and the update-op for all trainable vars. See the code ofUpdater.create_optimizer()
for valid config options.Wraps one or multiple tf.compat.v1.train.Optimizer, and extends it by some further functionality.
Note: Vincent Vanhoucke says, in case you get nans, consider increasing the epsilon (for Adam, Nadam and similar). This is the config option
optimizer_epsilon
. In some places in our Theano code, 1e-16 is our default epsilon, in some other parts, 1e-8 is. 1e-8 might be more stable. Or even 1e-6. Note that when the gradient is suddenly zero in one step, the update can be proportional to lr / eps.From the
tf.compat.v1.train.AdamOptimizer
documentation:The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.
More from Vincent Vanhoucke:
One thing you can do is run with a tiny learning rate, or even zero learning rate. If you still have divergence then, you have a bug in your setup. If not, increase your rate slowly and see if there is a regime in which things train without diverging. It’s completely possible to have weights that are in a good range, but activations or gradients going to infinity because of the shape of the loss, or too high a learning rate. It’s obviously always a possibility that there is a bug in the optimizers, but in my experience, every single instance of this kind of problem could be traced back to a weirdly wired model, learning rate issues, bad randomization of the input examples, or - in the case of Adam or RMSProp - issues with the epsilon value.
In addition, you might also want to try
gradient_nan_inf_filter
or maybe set beta1=0.5.For further debugging, see
tf.add_check_numerics_ops()
oradd_check_numerics_ops_and_debug_print()
, which is config optiondebug_add_check_numerics_ops
. Also relevant are config optionsdebug_add_check_numerics_on_output
anddebug_grad_summaries
.- Parameters:
config (returnn.config.Config) –
network (TFNetwork) –
initial_learning_rate (float) –
- reset_optim_op()[source]#
Call this if sth is changed which the optim_op depends on. See self.create_optim_op().
- set_learning_rate(value, session)[source]#
- Parameters:
value (float) –
session (tf.compat.v1.Session) –
- create_optim_op()[source]#
Creates the optimize TF op.
- Returns:
nothing, will just set self.optim_op
- get_optim_op(callback_on_new=None)[source]#
- Parameters:
callback_on_new (None|()->None) –
- Return type:
tf.Operation
- get_default_optimizer_item(auto_create_new)[source]#
- Parameters:
auto_create_new (bool) –
- Returns:
key, optimizer
- Return type:
(object, tf.compat.v1.train.Optimizer)
- get_slot_names_per_optimizer()[source]#
- Returns:
ordered dict: opt key -> slot names
- Return type:
dict[object, list[str]]
- filter_var_list_per_optimizer_key(var_list, opt_key)[source]#
- Parameters:
var_list (list[tf.Variable]) –
opt_key (object) – should be in self.optimizer
- Return type:
list[tf.Variable]
- returnn.tf.updater.accum_grad_multiple_step(grad, var, train_step, num_accum_steps)[source]#
- Parameters:
grad (tf.Tensor|tf.IndexedSlices) –
var (tf.Variable) –
train_step (tf.Tensor) – int, scalar
num_accum_steps (int) –
- Returns:
modified grad
- Return type:
tf.Tensor
- class returnn.tf.updater.BaseCustomOptimizer(learning_rate, use_locking=False, name=None)[source]#
Base class for our own optimizer implementations. This simplifies the interface to be implemented a bit from
Optimizer
. You just have to implement_apply()
here. SeeCustomGradientDescentOptimizer
orCustomAdamOptimizer
for as an example.Construct a new optimizer.
- Args:
- learning_rate: A Tensor or a floating point value. The learning
rate to use.
use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to self.__class__.__name__.
- class returnn.tf.updater.CustomGradientDescentOptimizer(learning_rate, use_locking=False, name=None)[source]#
Just an example implementation for simple gradient descent.
Construct a new optimizer.
- Args:
- learning_rate: A Tensor or a floating point value. The learning
rate to use.
use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to self.__class__.__name__.
- class returnn.tf.updater.NormalizedSGD(learning_rate, use_locking=False, name=None)[source]#
All grads are L2 normalized (via
tf.nn.l2_normalize()
), otherwise it’s standard SGD. Via: https://github.com/kmkolasinski/deep-learning-notes/tree/master/max-normed-optimizerConstruct a new optimizer.
- Args:
- learning_rate: A Tensor or a floating point value. The learning
rate to use.
use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to self.__class__.__name__.
- class returnn.tf.updater.NeuralOptimizer1(beta1=0.9, decrease_factor=0.1, **kwargs)[source]#
Via Neural Optimizer Search with Reinforcement Learning (https://proceedings.mlr.press/v70/bello17a/bello17a.pdf).
Equivalent to the optimizer g * exp(sign(g) * sign(m)), we use:
g * where(sign(g) == sign(m), 1.0, decrease_factor)
where m is the running average of g.
Calculation of m: m_t <- beta1 * m_{t-1} + (1 - beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9
- Parameters:
beta1 (float) – used for the running average of m
decrease_factor (float) – in the original paper, it is e^-2 ~= 0.135
- class returnn.tf.updater.GradVarianceScaledOptimizer(beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)[source]#
Let m be the running average of g. Calculation of m: m_t <- beta1 * m_{t-1} + (1 - beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9
Let v be the running average of the variance of g, i.e. of (g - m)^2.
- Parameters:
beta1 (float) – used for the running average of g (m)
beta2 (float) – used for the running average of variance of g (v)
epsilon (float) –
- class returnn.tf.updater.NadamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')[source]#
Optimizer that implements the Nadam algorithm. See [Dozat, T., 2015](http://cs229.stanford.edu/proj2015/054_report.pdf).
Copied from: https://github.com/tensorflow/tensorflow/blob/v1.15.5/tensorflow/contrib/opt/python/training/nadam_optimizer.py
We have this here to have this Nadam variant available in TF 2 because the Keras Nadam behaves a bit different. https://github.com/rwth-i6/returnn/issues/766 https://github.com/tensorflow/tensorflow/issues/53204
We can still use this old code because the underlying kernel still supports the
use_nesterov
option.Construct a new Adam optimizer.
Initialization:
$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$
The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper:
$$t := t + 1$$ $$text{lr}_t := mathrm{learning_rate} *
sqrt{1 - beta_2^t} / (1 - beta_1^t)$$
$$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$text{variable} := text{variable} -
text{lr}_t * m_t / (sqrt{v_t} + epsilon)$$
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.
The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).
- Args:
learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay
rate for the 1st moment estimates.
- beta2: A float value or a constant float tensor. The exponential decay
rate for the 2nd moment estimates.
- epsilon: A small constant for numerical stability. This epsilon is
“epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.
Defaults to “Adam”.
- class returnn.tf.updater.CustomAdamOptimizer(beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)[source]#
Reimplementation of Adam. See also
tf.compat.v1.train.AdamOptimizer
.``` t <- t + 1 lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * g v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon) ```
- Parameters:
beta1 (float) – used for the running average of g (m)
beta2 (float) – used for the running average of g*g (v)
epsilon (float) –
- class returnn.tf.updater.AMSGradOptimizer(learning_rate=0.001, decay=False, beta1=0.9, beta2=0.99, epsilon=0.0, var_list=())[source]#
https://colab.research.google.com/notebook#fileId=1xXFAuHM2Ae-OmF5M8Cn9ypGCa_HHBgfG&scrollTo=N1-2wPHN1Otn https://openreview.net/pdf?id=ryQu7f-RZ https://keras.io/optimizers/ https://ruder.io/deep-learning-optimization-2017/index.html#fixingtheexponentialmovingaverage https://github.com/taki0112/AMSGrad-Tensorflow
Create a new Optimizer.
This must be called by the constructors of subclasses.
- Args:
- use_locking: Bool. If True apply use locks to prevent concurrent updates
to variables.
- name: A non-empty string. The name to use for accumulators created
for the optimizer.
- Raises:
ValueError: If name is malformed.