returnn.frontend.decoder.transformer
¶
(Label-sync) Transformer decoder, optionally including cross attention to encoder
Also see returnn.frontend.encoder.transformer
.
References:
(Original paper of course) https://pytorch.org/docs/stable/_modules/torch/nn/modules/transformer.html#Transformer https://github.com/pytorch-labs/gpt-fast https://github.com/karpathy/minGPT/blob/master/mingpt/model.py https://github.com/karpathy/nanoGPT/blob/master/model.py https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/transformer/transformer_decoder.py
- class returnn.frontend.decoder.transformer.TransformerDecoder(encoder_dim: ~returnn.tensor.dim.Dim | None, vocab_dim: ~returnn.tensor.dim.Dim, model_dim: ~returnn.tensor.dim.Dim | int = Dim{'transformer-dec-default-model-dim'(512)}, *, num_layers: int, ff: type | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, ff_dim: ~returnn.tensor.dim.Dim | int = <class 'returnn.util.basic.NotSpecified'>, ff_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, pos_enc: None | ~typing.Callable | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <function sinusoidal_positional_encoding>, dropout: float = 0.1, num_heads: int = 8, att_dropout: float = 0.1, norm: type | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module | ~typing.Callable = <class 'returnn.frontend.normalization.LayerNorm'>, decoder_layer: ~returnn.frontend.decoder.transformer.TransformerDecoderLayer | ~returnn.frontend.module.Module | type | ~typing.Any | None = None, decoder_layer_opts: ~typing.Dict[str, ~typing.Any] | None = None, embed_dim: ~returnn.tensor.dim.Dim | None = None, share_embedding: bool | None = None, input_embedding_scale: float | None = None, input_dropout: float | None = None, logits_with_bias: bool = False, sequential=<class 'returnn.frontend.container.Sequential'>)[source]¶
Represents the Transformer decoder architecture
- Parameters:
encoder_dim – for cross-attention. None if no cross-attention.
vocab_dim
model_dim – the output feature dimension
num_layers – the number of encoder layers
ff – feed-forward / MLP block. Default is
FeedForward
ff_dim – the dimension of feed-forward layers. 2048 originally, or 4 times out_dim
ff_activation – activation function for feed-forward network
pos_enc – positional encoding. Default is sinusoidal positional encoding.
dropout – the dropout value for the FF block
num_heads – the number of attention heads
att_dropout – attention dropout value
norm – pre-normalization for FF and attention blocks
decoder_layer – an instance of
TransformerDecoderLayer
or similardecoder_layer_opts – options for the encoder layer
embed_dim – if given, will first have an embedding [vocab,embed] and then a linear [embed,model].
share_embedding
input_embedding_scale
input_dropout
logits_with_bias
sequential
- class returnn.frontend.decoder.transformer.TransformerDecoderLayer(encoder_dim: ~returnn.tensor.dim.Dim | None, out_dim: ~returnn.tensor.dim.Dim = Dim{'transformer-dec-default-out-dim'(512)}, *, ff: type | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, ff_dim: ~returnn.tensor.dim.Dim | int = <class 'returnn.util.basic.NotSpecified'>, ff_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, dropout: float = 0.1, num_heads: int = 8, self_att: ~returnn.frontend.attention.CausalSelfAttention | ~returnn.frontend.attention.RelPosCausalSelfAttention | ~returnn.frontend.module.Module | type | ~typing.Dict[str, ~typing.Any] | None = None, self_att_opts: ~typing.Dict[str, ~typing.Any] | None = None, att_dropout: float = 0.1, norm: type | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module | ~typing.Callable = <class 'returnn.frontend.normalization.LayerNorm'>)[source]¶
Represents a conformer block
- Parameters:
encoder_dim – for cross-attention. None if no cross-attention.
out_dim – the output feature dimension
ff – feed-forward / MLP block. Default is
FeedForward
ff_dim – the dimension of feed-forward layers. 2048 originally, or 4 times out_dim
ff_activation – activation function for feed-forward network
dropout – the dropout value for the FF block
num_heads – the number of attention heads
self_att – the self-attention layer. CausalSelfAttention originally and default
self_att_opts – options for the self-attention layer, for
nn.RelPosSelfAttention
att_dropout – attention dropout value
norm – pre-normalization for FF and attention blocks
- class returnn.frontend.decoder.transformer.FeedForward(out_dim: ~returnn.tensor.dim.Dim, *, ff_dim: ~returnn.tensor.dim.Dim | int | None = <class 'returnn.util.basic.NotSpecified'>, dropout: float = 0.1, activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <function relu>, with_bias: bool = True)[source]¶
- Transformer position-wise feedforward neural network layer
FF -> Activation -> Dropout -> FF
- Parameters:
out_dim – output feature dimension
ff_dim – dimension of the feed-forward layers
dropout – dropout value
activation – activation function, relu by default
with_bias – whether to use bias in the linear layers. True by default for compatibility, but nowadays it’s common to use without bias.
- class returnn.frontend.decoder.transformer.FeedForwardGated(out_dim: ~returnn.tensor.dim.Dim, *, ff_dim: ~returnn.tensor.dim.Dim | int | None = <class 'returnn.util.basic.NotSpecified'>, dropout: float = 0.1, activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <function silu>, gate_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <function identity>, with_bias: bool = False)[source]¶
E.g. with f=swish=silu: SwiGLU, from GLU Variants Improve Transformer:
f(Linear(x)) * Linear(x)
This is a feed-forward block based on SwiGLU, as defined in the paper.
Alternative to
FeedForward
.- Parameters:
out_dim
ff_dim – intermediate dimension. Unlike
FeedForward
: If not provided, factor 4*2/3 to keep same number of parameters as in the originalFeedForward
, just as in the paper, and also making it a multiple of 256.dropout
activation – activation function for the gating. unlike
FeedForward
, default is swish.with_bias – whether to use bias in the linear layers. unlike
FeedForward
, default is False.
- returnn.frontend.decoder.transformer.make_norm(norm: type | Dict[str, Any] | Module | Callable, out_dim: Dim) Module | Callable [source]¶
- Parameters:
norm – norm type or dict or module or callable. e.g.
rf.LayerNorm
out_dim – model/out dim
- Returns:
norm module or callable. e.g.
rf.LayerNorm(out_dim)