Debugging neural networks is hard. The worst situation is that training and inference works but the scores are just not good. It’s simpler if it does not even start because of some error. Here I will outline several useful options, and methods in general.

Random config options which can always be enabled without loosing performance:

  • log_verbosity = 5: Increases log verbosity. This will enable printing loss and other info per each individual batch.

  • log_batch_size = True: Logs individual batch sizes (requires log verbosity 5).

  • tf_log_memory_usage = True: TF: Log device (GPU) memory usage, per batch and also summary.

  • torch_log_memory_usage = True: PT: Log device (GPU) memory usage, per batch and also summary.

  • use_lovely_tensors = True: PT: Apply the lovely tensors monkey patch for torch.Tensor.__repr__

  • watch_memory = True: Enables background CPU memory usage (RSS, PSS, USS) logging per subprocess.

Random config options which will cause slower training or require more memory:

  • debug_add_check_numerics_ops = True: See below.

  • debug_add_check_numerics_on_output = True: See below.

Random other config options:

  • dry_run = True: Do not save models or write any files.

  • use_dummy_datasets = True: Use dummy datasets, which create random data on the fly.

  • startup_callback: Function callback, will run early at init, can be used to setup potential other debugging utilities.

  • debug_shell_in_runner = True: TF: Get an interactive shell right before the session runs a step.

Test case

In many cases, it can be helpful to write a test case to reproduce some error. This can then also be part of RETURNN such that this error will not happen anymore in the future. And it is helpful to better understand the problem and to debug it on your local computer, esp with an interaticte graphical debugger like PyCharm.

If this is not an error, but just strange or wrong behavior, it can be helpful to setup some toy data, and small network, which you can also run on your local computer. A fast turnaround time is important, and easy access to all information you need.

It might also be useful to inspect the gradients, if they look like what you expect (gradients w.r.t. activations).

TensorFlow tools

When using the TensorFlow backend in RETURNN: You can use all the native TensorFlow debugging tools. Please refer to the TF documentation or other resources.

Some functionality:

For example, to use tfdbg, try this is the config:

def wrap_tf_session(session):
    """Wrap a TensorFlow session to add debugging capabilities."""
    from tensorflow.python import debug as tf_debug

    return tf_debug.LocalCLIDebugWrapperSession(session)

Interactive debugging

Often just looking at the stack trace already makes it clear what the problem is, esp with the additional information added via better_exchook. However, sometimes it can be more helpful to interactively debug it, i.e. to use an interactive shell.

There is the option debug_shell_in_runner to get a Python shell directly in the main loop (over the steps / mini batches), with all local variables available, and the feed_dict already prepared, such that you can interactively run on tensors, etc.

You can run RETURNN via:

ipython --pdb

That will give you the IPython debugger shell once you hit an unhandled exception. You can summon the interactive shell by explicitly calling the following from the source code or from the config:

from returnn.util.debug import debug_shell
debug_shell(user_ns=locals(), user_global_ns=globals(), exit_afterwards=False)

Setting up a graphical debugger (e.g. PyCharm) and using breakpoints can be even more helpful. For that, it might be useful to have setup a small toy example which you can easily start on your local computer.

Also see Remote Development and Debugging.

Shapes and Tensor

(TensorFlow specific.)

There is debug_print_layer_output_template which can always be enabled, as it only prints additional information about the shape and Tensor for every layer at startup time, so it does not add any cost at runtime. This is very helpful, as you can go through that information to double check whether the output shape/type of each layer is as expected. Most errors can be localized this way.

There is also debug_print_layer_output_shape which is only useful for debugging, as it will print the output shape at runtime for every single step.

Runtime performance

E.g. the model trains slower than expected. Or you just want to see if there are possibilities to make it faster. See Profiling (Python generic).

Memory leaks

Memory leaks means: Some memory is allocated, and it is never freed. As we execute the code over and over again, we allocate more and more memory. Sometimes the leak can be small, so that it is not noticeable for a while, and only after some hours or even days of training, it crashes with some ouf-of-memory (OOM) error (GPU/CUDA OOM or CPU OOM).

It can be tricky to find the source of the leak. Many profiler tools try also to report memory usage. See Profiling (Python generic).

Getting nan/inf

There are various possible sources. In general, you get these for calculations like x/0.0, log(0.0), …

Use debug_add_check_numerics_on_output to enable runtime checks after every layer. That will help you localize where it occurs. This adds slightly to the memory requirements and also makes it slightly slower, but it is still reasonably fast.

debug_add_check_numerics_ops does the same, but for every single tensor. This is usually too expensive.

Options like debug_grad_summaries or debug_save_updater_vars can also be helpful to localize e.g. a variable which explodes during training. See monitoring.


By default, RETURNN will dump all the losses and error information to a TensorFlow event file. This can be watched live (but also afterwards) via TensorBoard. The default directory of this log dir is the same as the model dir, but you can also configure it via tf_log_dir.

You would go into this log dir, and then:

tensorboard --logdir .

Bad scores

There is no crash, no nan/inf, but you just get bad scores. This is the hardest to debug case. Maybe you have a bug somewhere but you don’t know.

If you are reproducing some existing research, and there is another existing implementation of it, this is a very good starting point. You can try to reproduce the exact same model in RETURNN, and write a model importer script which imports a trained model from the existing other implementation over to your RETURNN model. Now you can write a script where you feed in exactly the same input to both, and compare hidden activations of each layer (or do some binary search). That is a systematic way to verify that you have exactly the same. You find a few such example scripts under tools/import-*.

If you are playing with a new type of model, it helps to first try it on some toy dataset, where you know that it must work in principle. If it does not, you can design the toy samples in a way that helps you understand where it fails. In the extreme case, in theory, you should even be able to set the neural network weights by hand to solve the toy task. If you don’t know how, then maybe your model is actually not powerful enough. If that works, you can make the toy task successively harder and more similar to the real task. If all the toy tasks work, but the real task still does not, maybe you need some sort of curriculum learning or pretraining.

Think about ways to visualize some of the internals of your model. E.g. for attention models, it helps to visualize the attention weights. In many other cases, this can be hard, though.

Measure things. Whatever you think is in some way useful, or gives you a hint whether it is doing the correct thing or not.

Also see Analysing neural networks.

Python exception

RETURNN uses better_exchook which will automatically provide an extended Python stack trace which normally should provide enough information to understand the problem and to fix it. Maybe interactively debugging this can be helpful: See Interactive debugging.

If there is a bug in RETURNN itself (or might be): In principle, a good way to work on a fix in a systematic way is to create a simple test case which reproduces the problem. Simplify further as much as possible to identify and understand the real problem. Then fix it. Commit both the test case and the fix (pull request).


E.g. segmentation fault (segfault, SIGSEGV).

RETURNN uses the faulthandler Python module to provide a stack trace of the Python calls and also installs a native signal handler to provide a native stack trace (including e.g. native PyTorch or TensorFlow code).

Other resources