Attention Layers¶

Note that more specific attention layers are deprecated. It is recommend to define the attention energy explicitly, and then use returnn.tf.layers.rec.GenericAttentionLayer.

Generic Attention Layer¶

class returnn.tf.layers.rec.GenericAttentionLayer(weights, auto_squeeze=True, **kwargs)[source]¶

The weighting for the base is specified explicitly here. This can e.g. be used together with SoftmaxOverSpatialLayer. Note that we do not do any masking here. E.g. SoftmaxOverSpatialLayer does that.

Note that DotLayer is similar, just using a different terminology. Reduce axis: weights: time-axis; base: time-axis.

Note that if the last layer was SoftmaxOverSpatialLayer, we should use the same time-axis. Also we should do a check whether these time axes really match.

Common axes (should match): batch-axis, all from base excluding base feature axis and excluding time axis. Keep axes: base: feature axis; weights: all remaining, e.g. extra time.

Parameters:

base (LayerBase) – encoder output to attend on. (B, enc-time)|(enc-time, B) + (…) + (n_out,)
weights (LayerBase) – attention weights. ((B, enc-time)|(enc-time, B)) + (1,)|()
auto_squeeze (bool) – auto-squeeze any weight-axes with dim=1 away

layer_class: Optional[str] = 'generic_attention'[source]¶

recurrent = True[source]¶

base_weights: Optional[tf.Tensor][source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(base, weights, auto_squeeze=True, sources=(), **kwargs)[source]¶

Parameters:

base (LayerBase)
weights (LayerBase)
auto_squeeze (bool)
sources (list[LayerBase]) – ignored, should be empty (checked in __init__)

Return type:

Data

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Self-Attention Layer¶

class returnn.tf.layers.rec.SelfAttentionLayer(num_heads, total_key_dim, key_shift=None, forward_weights_init='glorot_uniform', attention_dropout=0.0, attention_left_only=False, initial_state=None, restrict_state_to_last_seq=False, state_var_lengths=None, **kwargs)[source]¶

Applies self-attention on the input. I.e., with input x, it will basically calculate

att(Q x, K x, V x),

where att is multi-head dot-attention for now, Q, K, V are matrices. The attention will be over the time-dimension. If there is no time-dimension, we expect to be inside a RecLayer; also, this is only valid with attention_to_past_only=True.

See also dot_product_attention here:: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py

Parameters:

num_heads (int)
total_key_dim (int) – i.e. key_dim == total_key_dim // num_heads
key_shift (LayerBase|None) – additive term to the key. can be used for relative positional encoding. Should be of shape (num_queries,num_keys,key_dim), currently without batch-dimension. I.e. that should be shape (1,t,key_dim) inside rec-layer or (T,T,key_dim) outside.
forward_weights_init (str) – see returnn.tf.util.basic.get_initializer()
attention_dropout (float)
attention_left_only (bool) – will mask out the future. see Attention is all you need.
initial_state (str|float|int|None) – see RnnCellLayer.get_rec_initial_state_inner().
restrict_state_to_last_seq (bool) – see code comment below
state_var_lengths (None|tf.Tensor|()->tf.Tensor) – if passed, a Tensor containing the number of keys in the state_var for each batch-entry, used for decoding in RASR.

layer_class: Optional[str] = 'self_attention'[source]¶

recurrent = True[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, sources, n_out=<class 'returnn.util.basic.NotSpecified'>, out_dim=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶

Parameters:

n_out (int|NotSpecified)
name (str)
sources (list[LayerBase])
n_out
out_dim (Dim|NotSpecified)

Return type:

Data

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, network, num_heads, total_key_dim, name, out_dim=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, initial_state=None, sources=(), **kwargs)[source]¶

Parameters:

batch_dim (tf.Tensor)
rec_layer (RecLayer|LayerBase)
network (returnn.tf.network.TFNetwork)
num_heads (int)
total_key_dim (int)
out_dim (Dim)
n_out (int)
name (str)
initial_state (str|float|int|None)
sources (list[LayerBase])

Return type:

dict[str, tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, sources, network, num_heads, total_key_dim, out_dim=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶

Parameters:

rec_layer (returnn.tf.layers.rec.RecLayer|LayerBase|None) – for the scope
sources (list[LayerBase])
network (returnn.tf.network.TFNetwork)
num_heads (int)
total_key_dim (int)
out_dim (Dim)
n_out (int)

Return type:

dict[str, tf.TensorShape]

post_process_final_rec_vars_outputs(rec_vars_outputs, seq_len)[source]¶

Parameters:

rec_vars_outputs (dict[str,tf.Tensor])
seq_len (tf.Tensor) – shape (batch,)

Return type:

dict[str,tf.Tensor]

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Concatenative Attention Layer¶

Deprecated

class returnn.tf.layers.rec.ConcatAttentionLayer(**kwargs)[source]¶

Additive attention / tanh-concat attention as similarity measure between base_ctx and source. This is used by Montreal, where as Stanford compared this to the dot-attention. The concat-attention is maybe more standard for machine translation at the moment.

Parameters:

base (LayerBase) – encoder output to attend on
base_ctx (LayerBase) – encoder output used to calculate the attention weights

layer_class: Optional[str] = 'concat_attention'[source]¶

base_weights: Optional[tf.Tensor][source]¶

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Dot-Product Attention Layer¶

Deprecated

class returnn.tf.layers.rec.DotAttentionLayer(energy_factor=None, **kwargs)[source]¶

Classic global attention: Dot-product as similarity measure between base_ctx and source.

Parameters:

base (LayerBase) – encoder output to attend on. defines output-dim
base_ctx (LayerBase) – encoder output used to calculate the attention weights, combined with input-data. dim must be equal to input-data
energy_factor (float|None) – the energy will be scaled by this factor. This is like a temperature for the softmax. In Attention-is-all-you-need, this is set to 1/sqrt(base_ctx.dim).

layer_class: Optional[str] = 'dot_attention'[source]¶

base_weights: Optional[tf.Tensor][source]¶

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Gauss Window Attention Layer¶

Deprecated

class returnn.tf.layers.rec.GaussWindowAttentionLayer(window_size, std=1.0, inner_size=None, inner_size_step=0.5, **kwargs)[source]¶

Interprets the incoming source as the location (float32, shape (batch,)) and returns a gauss-window-weighting of the base around the location. The window size is fixed (TODO: but the variance can optionally be dynamic).

Parameters:

window_size (int) – the window size where the Gaussian window will be applied on the base
std (float) – standard deviation for Gauss
inner_size (int|None) – if given, the output will have an additional dimension of this size, where t is shifted by +/- inner_size_step around. e.g. [t-1,t-0.5,t,t+0.5,t+1] would be the locations with inner_size=5 and inner_size_step=0.5.
inner_size_step (float) – see inner_size above

layer_class: Optional[str] = 'gauss_window_attention'[source]¶

classmethod get_out_data_from_opts(inner_size=None, **kwargs)[source]¶

Parameters:: inner_size (int|None)
Return type:: Data

base_weights: Optional[tf.Tensor][source]¶

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶