Updater(config, tf_session, network)¶
This will create the
tf.train.Optimizerinstance given the config and the update-op for all trainable vars. See the code of
Updater.create_optimizer()for valid config options.
Note: Vincent Vanhoucke says, in case you get nans, consider increasing the epsilon (for Adam, Nadam and similar). This is the config option
optimizer_epsilon. In some places in our Theano code, 1e-16 is our default epsilon, in some other parts, 1e-8 is. 1e-8 might be more stable. Or even 1e-6. Note that when the gradient is suddenly zero in one step, the update can be proportional to lr / eps.
tf.train.AdamOptimizerdocumentation:The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.
More from Vincent Vanhoucke:One thing you can do is run with a tiny learning rate, or even zero learning rate. If you still have divergence then, you have a bug in your setup. If not, increase your rate slowly and see if there is a regime in which things train without diverging. It’s completely possible to have weights that are in a good range, but activations or gradients going to infinity because of the shape of the loss, or too high a learning rate. It’s obviously always a possibility that there is a bug in the optimizers, but in my experience, every single instance of this kind of problem could be traced back to a weirdly wired model, learning rate issues, bad randomization of the input examples, or - in the case of Adam or RMSProp - issues with the epsilon value.
In addition, you might also want to try
gradient_nan_inf_filteror maybe set beta1=0.5.
For further debugging, see
add_check_numerics_ops_and_debug_print(), which is config option
Parameters: callback_on_new (None|()->None) – Return type: tf.Operation
Call this if sth is changed which the optim_op depends on. See self.create_optim_op().
add_check_numerics_ops(fetches=None, ignore_ops=None, use_check_numerics=True, debug_print_added_checks=True, name='add_check_numerics_ops')¶
This is similar to
tf.add_check_numerics_ops()and based on similar code. It adds some more logic and options.
- fetches (list[tf.Operation|tf.Tensor]|None) – in case this is given, will only look at these and dependent ops
- ignore_ops (list[str]) – e.g. “”
- use_check_numerics (bool) – if False, instead of
tf.check_numerics(), it does the check manually (via
tf.is_finite()) and in case there is inf/nan, it will also print the tensor (while tf.check_numerics does not print the tensor). Note that this can be about 50 times slower.
- debug_print_added_checks (bool) – prints info about each added check
- name (str) – op-name for the final tf.group
operation which performs all the checks
Parameters: class_name (str) – e.g. “adam” Returns: