This is a summary and overview of all relevant training aspects. See also Training.
Training a neural network is usually done by gradient descent on a differentiable loss function (differentiable w.r.t. the model parameters), e.g. cross entropy, mean squared error or CTC.
You can define any custom calculation as a loss. A loss could be used for supervised training (i.e. you have some ground truth) or unsupervised training, or anything else (just some auxiliary loss, or just for regularization, or not used for the training itself but just for evaluation).
A loss would only be used for training or evaluation but usually not in recognition.
Extra loss terms
L2 (can be added to layer options, see
Other auxiliary losses (supervised or unsupervised)
Variational param noise
Data augmentation (e.g. SpecAugment) ≡ extra layers on input
Directly in the network definition
Used in training only, or flexible / configurable
See also Regularization Layers.
The default is stochastic gradient descent (SGD), but Adam is also very common.
Also very relevant is the learning rate, and also the learning rate scheduling (see Learning rate scheduling).
You can also preload some existing weights, e.g. via
See Model Loading.
Learning rate scheduling#
Learning rate warmup (start small, increase, either linearly or exponentially)
Exponentially or inverse square root, or other variation
With constant rate, or depending on cross-validation score (sometimes called “Newbob”)
Reset learning rate on certain epochs, or increase again
learning_rate_control in your config.
Predefine certain learning rates (
learning_rates in config) for resets or warmup.
See Optimizer Settings.
Not only the learning rate can be scheduled, but many other aspects as well, such as:
Regularization (e.g. disable dropout initially, or have lower values)
Curriculum learning (i.e. take only an “easy” subset of training data initially, e.g. only the short sequences)
Apply gradient clipping only at the beginning of the training (see example below)
This can be done by overwriting config parameters using the _pretraining logic or
get_network() (see _custom_train_pipeline).
In either case, parameters set under
net_dict["#config"] will be used to overwrite existing config parameters.
gradient_clip = 0 def get_network(epoch: int, **kwargs): net_dict = ... if epoch < 5: net_dict["#config"]["gradient_clip"] = 10 return net_dict
Batching and dataset shuffling#
How to build up individual mini-batches (their size, and the logic for that)
Batch size (
How to shuffle the dataset (the sequences), or how to iterate through it
E.g. shuffle seqs, and sort buckets (
"laplace") to reduce padding
See Dataset Input/Output about how the dataset is loaded, and how you can implement your own custom dataset.
Pretraining can be understood as a phase before the main training, just to get the model parameters to a good starting point (despite parameter initialization).
Maybe a different loss during pretraining (e.g. unsupervised or custom)
Maybe train only a subset of the parameters
Different network topology every epoch, e.g. start with one layer, add more and more
Automatically copies over parameters from one epoch to the next as far as possible
New weights are newly initialized (e.g. randomly, see Parameter initialization)
If dimension increased, can copy over existing weights (grow in width / dim.)
Pretraining can be generalized to any custom training pipeline. See Custom training pipeline.
Custom training pipeline#
This can be seen as a generalization of pretraining (see Pretraining).
Train small NN using frame-wise cross-entropy with linear alignment
Calculate new alignment
Train NN using frame-wise cross-entropy with new alignment
Repeat with calculating new alignment (maybe increase NN size)
Train CTC model with CTC loss
Calculate new alignment
Train NN (e.g. transducer) using frame-wise cross-entropy with new alignment
def get_network(epoch: int, **kwargs): ... in your config.