Attention Layers¶
Note that more specific attention layers are deprecated.
It is recommend to define the attention energy explicitly,
and then use returnn.tf.layers.rec.GenericAttentionLayer.
Generic Attention Layer¶
- class returnn.tf.layers.rec.GenericAttentionLayer(weights, auto_squeeze=True, **kwargs)[source]¶
The weighting for the base is specified explicitly here. This can e.g. be used together with
SoftmaxOverSpatialLayer. Note that we do not do any masking here. E.g.SoftmaxOverSpatialLayerdoes that.Note that
DotLayeris similar, just using a different terminology. Reduce axis: weights: time-axis; base: time-axis.Note that if the last layer was
SoftmaxOverSpatialLayer, we should use the same time-axis. Also we should do a check whether these time axes really match.Common axes (should match): batch-axis, all from base excluding base feature axis and excluding time axis. Keep axes: base: feature axis; weights: all remaining, e.g. extra time.
- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- output_before_activation: OutputWithActivation | None[source]¶
- search_choices: SearchChoices | None[source]¶
Self-Attention Layer¶
- class returnn.tf.layers.rec.SelfAttentionLayer(num_heads, total_key_dim, key_shift=None, forward_weights_init='glorot_uniform', attention_dropout=0.0, attention_left_only=False, initial_state=None, restrict_state_to_last_seq=False, state_var_lengths=None, **kwargs)[source]¶
Applies self-attention on the input. I.e., with input x, it will basically calculate
att(Q x, K x, V x),
where att is multi-head dot-attention for now, Q, K, V are matrices. The attention will be over the time-dimension. If there is no time-dimension, we expect to be inside a
RecLayer; also, this is only valid with attention_to_past_only=True.- See also dot_product_attention here:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py
- Parameters:
num_heads (int)
total_key_dim (int) – i.e. key_dim == total_key_dim // num_heads
key_shift (LayerBase|None) – additive term to the key. can be used for relative positional encoding. Should be of shape (num_queries,num_keys,key_dim), currently without batch-dimension. I.e. that should be shape (1,t,key_dim) inside rec-layer or (T,T,key_dim) outside.
forward_weights_init (str) – see
returnn.tf.util.basic.get_initializer()attention_dropout (float)
attention_left_only (bool) – will mask out the future. see Attention is all you need.
initial_state (str|float|int|None) – see RnnCellLayer.get_rec_initial_state_inner().
restrict_state_to_last_seq (bool) – see code comment below
state_var_lengths (None|tf.Tensor|()->tf.Tensor) – if passed, a Tensor containing the number of keys in the state_var for each batch-entry, used for decoding in RASR.
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- classmethod get_out_data_from_opts(name, sources, n_out=<class 'returnn.util.basic.NotSpecified'>, out_dim=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶
- Parameters:
n_out (int|NotSpecified)
name (str)
sources (list[LayerBase])
n_out
out_dim (Dim|NotSpecified)
- Return type:
Data
- classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, network, num_heads, total_key_dim, name, out_dim=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, initial_state=None, sources=(), **kwargs)[source]¶
- Parameters:
batch_dim (tf.Tensor)
network (returnn.tf.network.TFNetwork)
num_heads (int)
total_key_dim (int)
out_dim (Dim)
n_out (int)
name (str)
initial_state (str|float|int|None)
sources (list[LayerBase])
- Return type:
dict[str, tf.Tensor]
- classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, sources, network, num_heads, total_key_dim, out_dim=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶
- Parameters:
rec_layer (returnn.tf.layers.rec.RecLayer|LayerBase|None) – for the scope
sources (list[LayerBase])
network (returnn.tf.network.TFNetwork)
num_heads (int)
total_key_dim (int)
out_dim (Dim)
n_out (int)
- Return type:
dict[str, tf.TensorShape]
- post_process_final_rec_vars_outputs(rec_vars_outputs, seq_len)[source]¶
- Parameters:
rec_vars_outputs (dict[str,tf.Tensor])
seq_len (tf.Tensor) – shape (batch,)
- Return type:
dict[str,tf.Tensor]
- output_before_activation: OutputWithActivation | None[source]¶
- search_choices: SearchChoices | None[source]¶
Concatenative Attention Layer¶
Deprecated
- class returnn.tf.layers.rec.ConcatAttentionLayer(**kwargs)[source]¶
Additive attention / tanh-concat attention as similarity measure between base_ctx and source. This is used by Montreal, where as Stanford compared this to the dot-attention. The concat-attention is maybe more standard for machine translation at the moment.
- Parameters:
- output_before_activation: OutputWithActivation | None[source]¶
- search_choices: SearchChoices | None[source]¶
Dot-Product Attention Layer¶
Deprecated
- class returnn.tf.layers.rec.DotAttentionLayer(energy_factor=None, **kwargs)[source]¶
Classic global attention: Dot-product as similarity measure between base_ctx and source.
- Parameters:
base (LayerBase) – encoder output to attend on. defines output-dim
base_ctx (LayerBase) – encoder output used to calculate the attention weights, combined with input-data. dim must be equal to input-data
energy_factor (float|None) – the energy will be scaled by this factor. This is like a temperature for the softmax. In Attention-is-all-you-need, this is set to 1/sqrt(base_ctx.dim).
- output_before_activation: OutputWithActivation | None[source]¶
- search_choices: SearchChoices | None[source]¶
Gauss Window Attention Layer¶
Deprecated
- class returnn.tf.layers.rec.GaussWindowAttentionLayer(window_size, std=1.0, inner_size=None, inner_size_step=0.5, **kwargs)[source]¶
Interprets the incoming source as the location (float32, shape (batch,)) and returns a gauss-window-weighting of the base around the location. The window size is fixed (TODO: but the variance can optionally be dynamic).
- Parameters:
window_size (int) – the window size where the Gaussian window will be applied on the base
std (float) – standard deviation for Gauss
inner_size (int|None) – if given, the output will have an additional dimension of this size, where t is shifted by +/- inner_size_step around. e.g. [t-1,t-0.5,t,t+0.5,t+1] would be the locations with inner_size=5 and inner_size_step=0.5.
inner_size_step (float) – see inner_size above
- classmethod get_out_data_from_opts(inner_size=None, **kwargs)[source]¶
- Parameters:
inner_size (int|None)
- Return type:
Data
- output_before_activation: OutputWithActivation | None[source]¶
- search_choices: SearchChoices | None[source]¶