returnn.frontend.attention
¶
Attention
- returnn.frontend.attention.dot_attention(query: Tensor, keys: Tensor, values: Tensor, *, key_dim: Dim, axis: Dim, att_dropout: float = 0.0, att_dropout_broadcast: bool | None = None) Tensor [source]¶
Calculates attention over the given axis, for given key dim. Any other unrelated axes do not matter here. This can be used for multi-head or single head. The query can have other dimensions or not.
- Parameters:
query – {…, key_dim}. For self-attention, do not use the axis as in keys and values, but rather replace it by another new dim via
replace_dim()
.keys – {…, axis, key_dim}
values – {…, axis}
key_dim – dim in keys and query, to be reduced to calculate the attention energies.
axis – in keys and values, to apply attention on. softmax will be over this axis, and then it will be reduced
att_dropout – dropout for attention weights
att_dropout_broadcast – whether to broadcast over all but
axis
. normally not wanted. disabled by default since behavior version 19.
- Returns:
like values but with axis removed, and maybe any additional axes from query
- class returnn.frontend.attention.SelfAttentionBase(in_dim: Dim, proj_dim: Dim | None, *, key_dim_total: Dim, value_dim_total: Dim, num_heads: int | Dim, with_bias: bool = True, att_dropout: float = 0.1, att_dropout_broadcast: bool | None = None)[source]¶
Shared base class for (non-causal) self attention (
SelfAttention
) and causal self attention (CausalSelfAttention
).It uses
dot_attention()
for multi-headed dot-attention.- Parameters:
in_dim – input dim
proj_dim – if given, will add a final linear projection to this dim. otherwise no projection after the attention
key_dim_total – total key dim. should be a multiple of num_heads
value_dim_total – total value dim. should be a multiple of num_heads
num_heads – number of heads
with_bias – whether to add bias to qkv and proj linear projections. Was False in original Transformer, but many recent implementations use True by default. Also see: https://github.com/rwth-i6/returnn_common/issues/234.
att_dropout – dropout for attention weights
att_dropout_broadcast – whether to broadcast over all but
axis
. normally not wanted. disabled by default since behavior version 19.
- class returnn.frontend.attention.SelfAttention(in_dim: Dim, proj_dim: Dim | None, *, key_dim_total: Dim, value_dim_total: Dim, num_heads: int | Dim, with_bias: bool = True, att_dropout: float = 0.1, att_dropout_broadcast: bool | None = None)[source]¶
Classic self attention on sequence level
- Parameters:
in_dim – input dim
proj_dim – if given, will add a final linear projection to this dim. otherwise no projection after the attention
key_dim_total – total key dim. should be a multiple of num_heads
value_dim_total – total value dim. should be a multiple of num_heads
num_heads – number of heads
with_bias – whether to add bias to qkv and proj linear projections. Was False in original Transformer, but many recent implementations use True by default. Also see: https://github.com/rwth-i6/returnn_common/issues/234.
att_dropout – dropout for attention weights
att_dropout_broadcast – whether to broadcast over all but
axis
. normally not wanted. disabled by default since behavior version 19.
- class returnn.frontend.attention.CausalSelfAttention(in_dim: Dim, proj_dim: Dim | None, *, key_dim_total: Dim, value_dim_total: Dim, num_heads: int | Dim, with_bias: bool = True, att_dropout: float = 0.1, att_dropout_broadcast: bool | None = None)[source]¶
Classic causal self attention
- Parameters:
in_dim – input dim
proj_dim – if given, will add a final linear projection to this dim. otherwise no projection after the attention
key_dim_total – total key dim. should be a multiple of num_heads
value_dim_total – total value dim. should be a multiple of num_heads
num_heads – number of heads
with_bias – whether to add bias to qkv and proj linear projections. Was False in original Transformer, but many recent implementations use True by default. Also see: https://github.com/rwth-i6/returnn_common/issues/234.
att_dropout – dropout for attention weights
att_dropout_broadcast – whether to broadcast over all but
axis
. normally not wanted. disabled by default since behavior version 19.
- default_initial_state(*, batch_dims: Sequence[Dim]) CausalSelfAttentionState [source]¶
For causal attention.
- class returnn.frontend.attention.CausalSelfAttentionState(*_args, k_accum: Tensor | None = None, v_accum: Tensor | None = None, accum_axis: Dim | None = None)[source]¶
State for
StepwiseCausalSelfAttention
.- Parameters:
k_accum – accumulated keys
v_accum – accumulated values
accum_axis
- class returnn.frontend.attention.RotaryPosSelfAttention(in_dim: Dim, proj_dim: Dim | None, *, key_dim_total: Dim, value_dim_total: Dim, num_heads: int | Dim, with_bias: bool = True, att_dropout: float = 0.1, att_dropout_broadcast: bool | None = None)[source]¶
Rotary positional encoding (RoPE)-based self attention
- Parameters:
in_dim – input dim
proj_dim – if given, will add a final linear projection to this dim. otherwise no projection after the attention
key_dim_total – total key dim. should be a multiple of num_heads
value_dim_total – total value dim. should be a multiple of num_heads
num_heads – number of heads
with_bias – whether to add bias to qkv and proj linear projections. Was False in original Transformer, but many recent implementations use True by default. Also see: https://github.com/rwth-i6/returnn_common/issues/234.
att_dropout – dropout for attention weights
att_dropout_broadcast – whether to broadcast over all but
axis
. normally not wanted. disabled by default since behavior version 19.
- class returnn.frontend.attention.RotaryPosCausalSelfAttention(in_dim: Dim, proj_dim: Dim | None, *, key_dim_total: Dim, value_dim_total: Dim, num_heads: int | Dim, with_bias: bool = True, att_dropout: float = 0.1, att_dropout_broadcast: bool | None = None)[source]¶
Rotary positional encoding (RoPE)-based causal self attention
- Parameters:
in_dim – input dim
proj_dim – if given, will add a final linear projection to this dim. otherwise no projection after the attention
key_dim_total – total key dim. should be a multiple of num_heads
value_dim_total – total value dim. should be a multiple of num_heads
num_heads – number of heads
with_bias – whether to add bias to qkv and proj linear projections. Was False in original Transformer, but many recent implementations use True by default. Also see: https://github.com/rwth-i6/returnn_common/issues/234.
att_dropout – dropout for attention weights
att_dropout_broadcast – whether to broadcast over all but
axis
. normally not wanted. disabled by default since behavior version 19.
- class returnn.frontend.attention.RelPosSelfAttention(in_dim: Dim, proj_dim: Dim | None, *, key_dim_total: Dim, value_dim_total: Dim, num_heads: int | Dim, with_bias: bool = True, with_linear_pos: bool = True, with_pos_bias: bool = True, learnable_pos_emb: bool = False, learnable_pos_emb_clipping: int = 16, separate_pos_emb_per_head: bool = True, pos_emb_dropout: float = 0.0, att_dropout: float = 0.1)[source]¶
Self-attention with relative positional encoding. This covers both Shawn et al. self-att rel pos 2018 (https://arxiv.org/abs/1803.02155), and Dai et al. Transformer-XL style 2019 (https://arxiv.org/abs/1901.02860).
It uses
relative_positional_encoding()
orLearnedRelativePositionalEncoding
.To get Shawn et al. self-att rel pos 2018 / RETURNN SelfAttentionLayer + RelativePositionalEncodingLayer: - with_bias = False (at least that was the RETURNN behavior) - with_linear_pos = False - with_pos_bias = False - learnable_pos_emb = True - separate_pos_emb_per_head = False (at least that was the RETURNN default)
To get Dai et al. Transformer-XL style 2019: - with_bias = False would be like the paper, however, in most implementations it is True (default) - with_linear_pos = True (default) - with_pos_bias = True (default) - learnable_pos_emb = True (default) - separate_pos_emb_per_head = True (default)
Further details: https://github.com/rwth-i6/returnn_common/wiki/Relative-positional-encoding
Code references, partly adapted from there: https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet/nets/pytorch_backend/transformer/embedding.py#L260 https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/model.py#L4
- Parameters:
in_dim – input dim
proj_dim – if given, will add a final linear projection to this dim. otherwise no projection after the attention
key_dim_total – total key dim. should be a multiple of num_heads
value_dim_total – total value dim. should be a multiple of num_heads
num_heads – number of heads
with_bias – whether to add bias to qkv and proj linear projections. Was False in original Transformer, but many recent implementations use True by default. Also see: https://github.com/rwth-i6/returnn_common/issues/234.
att_dropout – dropout for attention weights
att_dropout_broadcast – whether to broadcast over all but
axis
. normally not wanted. disabled by default since behavior version 19.
- class returnn.frontend.attention.RelPosCausalSelfAttention(in_dim: Dim, proj_dim: Dim | None, *, key_dim_total: Dim, value_dim_total: Dim, num_heads: int | Dim, with_bias: bool = True, with_linear_pos: bool = True, with_pos_bias: bool = True, learnable_pos_emb: bool = False, learnable_pos_emb_clipping: int = 16, separate_pos_emb_per_head: bool = True, pos_emb_dropout: float = 0.0, att_dropout: float = 0.1)[source]¶
Self-attention with relative positional encoding. This covers both Shawn et al. self-att rel pos 2018 (https://arxiv.org/abs/1803.02155), and Dai et al. Transformer-XL style 2019 (https://arxiv.org/abs/1901.02860).
It uses
relative_positional_encoding()
orLearnedRelativePositionalEncoding
.Same defaults as
RelPosSelfAttention
, which is mostly Transformer-XL style.Further details: https://github.com/rwth-i6/returnn_common/wiki/Relative-positional-encoding
Code references, partly adapted from there: https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet/nets/pytorch_backend/transformer/embedding.py#L260 https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/model.py#L4
- Parameters:
in_dim – input dim
proj_dim – if given, will add a final linear projection to this dim. otherwise no projection after the attention
key_dim_total – total key dim. should be a multiple of num_heads
value_dim_total – total value dim. should be a multiple of num_heads
num_heads – number of heads
with_bias – whether to add bias to qkv and proj linear projections. Was False in original Transformer, but many recent implementations use True by default. Also see: https://github.com/rwth-i6/returnn_common/issues/234.
att_dropout – dropout for attention weights
att_dropout_broadcast – whether to broadcast over all but
axis
. normally not wanted. disabled by default since behavior version 19.
- class returnn.frontend.attention.CrossAttention(encoder_dim: Dim, query_in_dim: Dim, proj_dim: Dim | None, *, key_dim_total: Dim, value_dim_total: Dim, num_heads: int | Dim, with_bias: bool = True, att_dropout: float = 0.1, att_dropout_broadcast: bool | None = None)[source]¶
Cross attention
It uses
dot_attention()
for multi-headed dot-attention.- Parameters:
encoder_dim – encoder output dim = input dim for key-value
query_in_dim – input dim for query
proj_dim – if given, will add a final linear projection to this dim. otherwise no projection after the attention
key_dim_total – total key dim. should be a multiple of num_heads
value_dim_total – total value dim. should be a multiple of num_heads
num_heads – number of heads
with_bias – whether to add bias to qkv and proj linear projections. Was False in original Transformer, but many recent implementations use True by default. Also see: https://github.com/rwth-i6/returnn_common/issues/234.
att_dropout – dropout for attention weights
att_dropout_broadcast – whether to broadcast over all but
axis
. normally not wanted. disabled by default since behavior version 19.
- transform_encoder(encoder: Tensor, *, axis: Dim) State [source]¶
Transformer encoder output. This is intended as an initial API suggestion.
- forward_kv(source: Tensor) Tuple[Tensor, Tensor] [source]¶
This would be calculated once for the whole sequence (batch) and then always reused for
attention()
.- Returns:
k,v
- class returnn.frontend.attention.LearnedRelativePositionalEncoding(feat_dim: Dim, *, clipping: int = 16, dtype: str | None = None, causal: bool = False)[source]¶
Learnable relative positional encoding.
E.g. as used in Shawn et al, 2018 (https://arxiv.org/abs/1803.02155).
https://github.com/rwth-i6/returnn_common/wiki/Relative-positional-encoding
- Parameters:
feat_dim – feature dim, for the emb matrix and output
clipping – max distance to consider. emb matrix shape is [2 * clipping + 1, feat_dim] if not causal, else [clipping + 1, feat]. The first and last frame will be the clipping frames.
dtype – for the emb matrix and output
- returnn.frontend.attention.relative_positional_encoding(*, query_spatial_dim: Dim, key_value_spatial_dim: Dim, feat_dim: Dim, query_offset: int = 0, dtype: str | None = None) Tuple[Tensor, Dim] [source]¶
Implements relative positional encoding, Transformer-XL style (https://arxiv.org/abs/1901.02860), as used for example by
RelPosSelfAttention
.Code references, partly adapted from there: https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet/nets/pytorch_backend/transformer/embedding.py#L260 https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/model.py#L4
Note that this encoding is stored in a cache so that it is only calculated once. and then reused.
Note that we could extend the implementation later to also buffer it even across mini-batches, like the ESPnet implementation does, e.g. by storing it in an auxiliary variable and increasing its size when needed. But this is not done yet, to keep the code simple.
- Returns:
tensor of shape [spatial_dim * 2 - 1, feat_dim], and the out spatial dim (spatial_dim * 2 - 1). In the center is the rel pos i-j=0. All to the right are for i-j>0, all to the left for i-j<0.
- returnn.frontend.attention.sinusoidal_positional_encoding(*, spatial_dim: Dim, feat_dim: Dim, offset: int | Tensor | None = None, base: int | float = 10000.0, dtype: str | None = None, device: str | None = None) Tensor [source]¶
Implements absolute sinusoidal positional encoding.
Code adopted from
relative_positional_encoding()
and our TF utilget_positional_encoding()
.Note that this encoding is stored in a cache so that it is only calculated once. and then reused.
Note that we could extend the implementation later to also buffer it even across mini-batches, like the ESPnet implementation does, e.g. by storing it in an auxiliary variable and increasing its size when needed. But this is not done yet, to keep the code simple.
- Returns:
tensor of shape [spatial_dim, feat_dim] if spatial_dim != single_step_dim else [feat_dim]