Attention Layers#
Note that more specific attention layers are deprecated.
It is recommend to define the attention energy explicitly,
and then use returnn.tf.layers.rec.GenericAttentionLayer
.
Generic Attention Layer#
- class returnn.tf.layers.rec.GenericAttentionLayer(weights, auto_squeeze=True, **kwargs)[source]#
The weighting for the base is specified explicitly here. This can e.g. be used together with
SoftmaxOverSpatialLayer
. Note that we do not do any masking here. E.g.SoftmaxOverSpatialLayer
does that.Note that
DotLayer
is similar, just using a different terminology. Reduce axis: weights: time-axis; base: time-axis.Note that if the last layer was
SoftmaxOverSpatialLayer
, we should use the same time-axis. Also we should do a check whether these time axes really match.Common axes (should match): batch-axis, all from base excluding base feature axis and excluding time axis. Keep axes: base: feature axis; weights: all remaining, e.g. extra time.
- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]#
- Parameters:
d (dict[str]) –
network (returnn.tf.network.TFNetwork) –
get_layer –
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
Self-Attention Layer#
- class returnn.tf.layers.rec.SelfAttentionLayer(num_heads, total_key_dim, key_shift=None, forward_weights_init='glorot_uniform', attention_dropout=0.0, attention_left_only=False, initial_state=None, restrict_state_to_last_seq=False, state_var_lengths=None, **kwargs)[source]#
Applies self-attention on the input. I.e., with input x, it will basically calculate
att(Q x, K x, V x),
where att is multi-head dot-attention for now, Q, K, V are matrices. The attention will be over the time-dimension. If there is no time-dimension, we expect to be inside a
RecLayer
; also, this is only valid with attention_to_past_only=True.- See also dot_product_attention here:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py
- Parameters:
num_heads (int) –
total_key_dim (int) – i.e. key_dim == total_key_dim // num_heads
key_shift (LayerBase|None) – additive term to the key. can be used for relative positional encoding. Should be of shape (num_queries,num_keys,key_dim), currently without batch-dimension. I.e. that should be shape (1,t,key_dim) inside rec-layer or (T,T,key_dim) outside.
forward_weights_init (str) – see
returnn.tf.util.basic.get_initializer()
attention_dropout (float) –
attention_left_only (bool) – will mask out the future. see Attention is all you need.
initial_state (str|float|int|None) – see RnnCellLayer.get_rec_initial_state_inner().
restrict_state_to_last_seq (bool) – see code comment below
state_var_lengths (None|tf.Tensor|()->tf.Tensor) – if passed, a Tensor containing the number of keys in the state_var for each batch-entry, used for decoding in RASR.
- classmethod transform_config_dict(d, network, get_layer)[source]#
- Parameters:
d (dict[str]) –
network (returnn.tf.network.TFNetwork) –
get_layer –
- classmethod get_out_data_from_opts(name, sources, n_out=<class 'returnn.util.basic.NotSpecified'>, out_dim=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]#
- Parameters:
n_out (int|NotSpecified) –
name (str) –
sources (list[LayerBase]) –
n_out –
out_dim (Dim|NotSpecified) –
- Return type:
Data
- classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, network, num_heads, total_key_dim, name, out_dim=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, initial_state=None, sources=(), **kwargs)[source]#
- Parameters:
batch_dim (tf.Tensor) –
network (returnn.tf.network.TFNetwork) –
num_heads (int) –
total_key_dim (int) –
out_dim (Dim) –
n_out (int) –
name (str) –
initial_state (str|float|int|None) –
sources (list[LayerBase]) –
- Return type:
dict[str, tf.Tensor]
- classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, sources, network, num_heads, total_key_dim, out_dim=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]#
- Parameters:
rec_layer (returnn.tf.layers.rec.RecLayer|LayerBase|None) – for the scope
sources (list[LayerBase]) –
network (returnn.tf.network.TFNetwork) –
num_heads (int) –
total_key_dim (int) –
out_dim (Dim) –
n_out (int) –
- Return type:
dict[str, tf.TensorShape]
- post_process_final_rec_vars_outputs(rec_vars_outputs, seq_len)[source]#
- Parameters:
rec_vars_outputs (dict[str,tf.Tensor]) –
seq_len (tf.Tensor) – shape (batch,)
- Return type:
dict[str,tf.Tensor]
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
Concatenative Attention Layer#
Deprecated
- class returnn.tf.layers.rec.ConcatAttentionLayer(**kwargs)[source]#
Additive attention / tanh-concat attention as similarity measure between base_ctx and source. This is used by Montreal, where as Stanford compared this to the dot-attention. The concat-attention is maybe more standard for machine translation at the moment.
- Parameters:
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
Dot-Product Attention Layer#
Deprecated
- class returnn.tf.layers.rec.DotAttentionLayer(energy_factor=None, **kwargs)[source]#
Classic global attention: Dot-product as similarity measure between base_ctx and source.
- Parameters:
base (LayerBase) – encoder output to attend on. defines output-dim
base_ctx (LayerBase) – encoder output used to calculate the attention weights, combined with input-data. dim must be equal to input-data
energy_factor (float|None) – the energy will be scaled by this factor. This is like a temperature for the softmax. In Attention-is-all-you-need, this is set to 1/sqrt(base_ctx.dim).
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
Gauss Window Attention Layer#
Deprecated
- class returnn.tf.layers.rec.GaussWindowAttentionLayer(window_size, std=1.0, inner_size=None, inner_size_step=0.5, **kwargs)[source]#
Interprets the incoming source as the location (float32, shape (batch,)) and returns a gauss-window-weighting of the base around the location. The window size is fixed (TODO: but the variance can optionally be dynamic).
- Parameters:
window_size (int) – the window size where the Gaussian window will be applied on the base
std (float) – standard deviation for Gauss
inner_size (int|None) – if given, the output will have an additional dimension of this size, where t is shifted by +/- inner_size_step around. e.g. [t-1,t-0.5,t,t+0.5,t+1] would be the locations with inner_size=5 and inner_size_step=0.5.
inner_size_step (float) – see inner_size above
- classmethod get_out_data_from_opts(inner_size=None, **kwargs)[source]#
- Parameters:
inner_size (int|None) –
- Return type:
Data
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#