returnn.frontend.encoder.e_branchformer

E-Branchformer (https://arxiv.org/pdf/2210.00077)

Example usage:

import returnn.frontend as rf
from returnn.frontend.encoder.conformer import ConformerEncoder
from returnn.frontend.encoder.e_branchformer import EBranchformerLayer

model = ConformerEncoder(
    out_dim=...,  # model dim, output_size in ESPnet
    num_layers=...,  # num_blocks in ESPnet
    encoder_layer=rf.build_dict(
        EBranchformerLayer,
        ff_dim=...,  # linear_units in ESPnet
        num_heads=...,  # attention_heads in ESPnet
        cgmlp_ff_dim=...,  # half of cgmlp_linear_units in ESPnet
        cgmlp_conv_kernel=...,  # cgmlp_conv_kernel in ESPnet
        merge_conv_kernel=...,  # merge_conv_kernel in ESPnet
    ),
)
class returnn.frontend.encoder.e_branchformer.EBranchformerLayer(out_dim: ~returnn.tensor.dim.Dim = Dim{'conformer-enc-default-out-dim'(512)}, *, ff: type | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, ff_dim: ~returnn.tensor.dim.Dim | int = <class 'returnn.util.basic.NotSpecified'>, ff_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, dropout: float = 0.1, num_heads: int = 4, self_att: ~returnn.frontend.attention.RelPosSelfAttention | ~returnn.frontend.module.Module | type | ~typing.Dict[str, ~typing.Any] | ~typing.Any = <class 'returnn.util.basic.NotSpecified'>, att_dropout: float = 0.1, cgmlp: type | ~typing.Dict[str, ~typing.Any] = <class 'returnn.util.basic.NotSpecified'>, cgmlp_ff_dim: ~returnn.tensor.dim.Dim | int = <class 'returnn.util.basic.NotSpecified'>, cgmlp_conv_kernel: int = 31, merge_conv_kernel: int = 3, norm: type | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module | ~typing.Callable = <class 'returnn.frontend.normalization.LayerNorm'>)[source]

E-Branchformer layer, e.g. to be used in the returnn.frontend.encoder.conformer.ConformerEncoder.

See the module docstring returnn.frontend.encoder.e_branchformer for an example.

By convention, any options to the module are passed to __init__, and potential changing inputs (other tensors) are passed to __call__().

class returnn.frontend.encoder.e_branchformer.FeedForwardConvGated(out_dim: ~returnn.tensor.dim.Dim, *, ff_dim: ~returnn.tensor.dim.Dim | int = <class 'returnn.util.basic.NotSpecified'>, kernel_size: int = 31, dropout: float = 0.1, activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <function gelu>, gate_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <function identity>, with_bias: bool = True, norm: type | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module | ~typing.Callable = <class 'returnn.frontend.normalization.LayerNorm'>)[source]

Convolutional Gating MLP (cgMLP) as introduced in https://openreview.net/forum?id=RA-zVvZLYIy and then used by the E-Branchformer model (https://arxiv.org/pdf/2210.00077). It uses the Convolutional Spatial Gating Unit (CSGU). This is the local extractor branch in the E-Branchformer model.

Related is the returnn.frontend.decoder.transformer.FeedForwardGated module.

Parameters:
  • out_dim – the encoder (e.g. E-Branchformer) model dim. (usually 256 or 512)

  • ff_dim – intermediate dimension. This is like cgmlp_linear_units/2 in ESPnet. Note the 1/2 factor, which is because in ESPnet, you specify the total dimension, before it is split for the gating, while here, you specify the dimension for the gating part. Common settings are 2048/2 or 3072/2. In the paper, they mention a factor of 3 of the model dimension (factor 6 for ESPnet setting).

  • kernel_size – for the depthwise convolution (usually 31)

  • dropout

  • activation – activation function after the first linear layer, for both parts. default as in the paper: gelu. Note, in returnn.frontend.decoder.transformer.FeedForwardGated, the activation arg is like gate_activation here.

  • gate_activation – activation function for the gate part, before the gating (mult) is applied. default as in the paper: identity. Note, in returnn.frontend.decoder.transformer.FeedForwardGated, the activation arg is like gate_activation here.

  • with_bias – whether to use bias in the linear layers and conv layer. default as in the paper: True.

  • norm – normalization layer. default as in the paper: LayerNorm.

class returnn.frontend.encoder.e_branchformer.Merge(*, in_dim1: Dim, in_dim2: Dim, out_dim: Dim, merge_conv_kernel: int = 3)[source]

The merge module from the E-Branchformer model.

By convention, any options to the module are passed to __init__, and potential changing inputs (other tensors) are passed to __call__().