This is a summary and overview of all relevant training aspects. See also Training.


Training a neural network is usually done by gradient descent on a differentiable loss function (differentiable w.r.t. the model parameters), e.g. cross entropy, mean squared error or CTC.

You can define one or multiple losses in your network. (See Network Structure and Loss Functions about how to define them.)

You can define any custom calculation as a loss. A loss could be used for supervised training (i.e. you have some ground truth) or unsupervised training, or anything else (just some auxiliary loss, or just for regularization, or not used for the training itself but just for evaluation).

A loss would only be used for training or evaluation but usually not in recognition.


  • Extra loss terms

  • Model variations

    • Dropout

    • Variational param noise

    • Stochastic depth

    • Data augmentation (e.g. SpecAugment) ≡ extra layers on input

  • Directly in the network definition

  • Used in training only, or flexible / configurable

See also Regularization Layers.


The default is stochastic gradient descent (SGD), but Adam is also very common.

See Optimizer and Optimizer Settings for more. E.g. optimizer = "adam" in your config.

Also very relevant is the learning rate, and also the learning rate scheduling (see Learning rate scheduling).

Parameter initialization

Most layers support to configure this via options like forward_weights_init (e.g. This will use the generic function

You can also preload some existing weights, e.g. via preload_from_files. See Model Loading.

Learning rate scheduling

Common features:

  • Learning rate warmup (start small, increase, either linearly or exponentially)

  • Constant phase

  • Decay

    • Exponentially or inverse square root, or other variation

    • With constant rate, or depending on cross-validation score (sometimes called “Newbob”)

  • Reset learning rate on certain epochs, or increase again

Set learning_rate_control in your config. Predefine certain learning rates (learning_rates in config) for resets or warmup. See Optimizer Settings.

Generic Scheduling

Not only the learning rate can be scheduled, but many other aspects as well, such as:

  • Regularization (e.g. disable dropout initially, or have lower values)

  • Curriculum learning (i.e. take only an “easy” subset of training data initially, e.g. only the short sequences)

  • Apply gradient clipping only at the beginning of the training (see example below)

This can be done by overwriting config parameters using the _pretraining logic or get_network() (see _custom_train_pipeline). In either case, parameters set under net_dict["#config"] will be used to overwrite existing config parameters. Example:

gradient_clip = 0

def get_network(epoch: int, **kwargs):
    net_dict = ...
    if epoch < 5:
        net_dict["#config"]["gradient_clip"] = 10
    return net_dict

Batching and dataset shuffling

  • How to build up individual mini-batches (their size, and the logic for that)

    • Batch size (batch_size, max_seqs, max_seq_length)

    • Chunking (chunking)

  • How to shuffle the dataset (the sequences), or how to iterate through it

    • E.g. shuffle seqs, and sort buckets ("laplace") to reduce padding

See Training.

See Dataset Input/Output about how the dataset is loaded, and how you can implement your own custom dataset.


Pretraining can be understood as a phase before the main training, just to get the model parameters to a good starting point (despite parameter initialization).

  • Maybe a different loss during pretraining (e.g. unsupervised or custom)

  • Maybe train only a subset of the parameters

  • Different network topology every epoch, e.g. start with one layer, add more and more

  • Automatically copies over parameters from one epoch to the next as far as possible

    • Configurable

    • New weights are newly initialized (e.g. randomly, see Parameter initialization)

    • If dimension increased, can copy over existing weights (grow in width / dim.)

See also Pre-Training / Dynamic Networks or Pretraining.

Pretraining can be generalized to any custom training pipeline. See Custom training pipeline.

Custom training pipeline

This can be seen as a generalization of pretraining (see Pretraining).


  1. Train small NN using frame-wise cross-entropy with linear alignment

  2. Calculate new alignment

  3. Train NN using frame-wise cross-entropy with new alignment

  4. Repeat with calculating new alignment (maybe increase NN size)


  1. Train CTC model with CTC loss

  2. Calculate new alignment

  3. Train NN (e.g. transducer) using frame-wise cross-entropy with new alignment

You define def get_network(epoch: int, **kwargs): ... in your config.

Multi-GPU training

See multi_gpu.

Deterministic training

See Deterministic training.