returnn.frontend.encoder.e_branchformer
¶
E-Branchformer (https://arxiv.org/pdf/2210.00077)
Example usage:
import returnn.frontend as rf
from returnn.frontend.encoder.conformer import ConformerEncoder
from returnn.frontend.encoder.e_branchformer import EBranchformerLayer
model = ConformerEncoder(
out_dim=..., # model dim, output_size in ESPnet
num_layers=..., # num_blocks in ESPnet
encoder_layer=rf.build_dict(
EBranchformerLayer,
ff_dim=..., # linear_units in ESPnet
num_heads=..., # attention_heads in ESPnet
cgmlp_ff_dim=..., # half of cgmlp_linear_units in ESPnet
cgmlp_conv_kernel=..., # cgmlp_conv_kernel in ESPnet
merge_conv_kernel=..., # merge_conv_kernel in ESPnet
),
)
- class returnn.frontend.encoder.e_branchformer.EBranchformerLayer(out_dim: ~returnn.tensor.dim.Dim = Dim{'conformer-enc-default-out-dim'(512)}, *, ff: type | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, ff_dim: ~returnn.tensor.dim.Dim | int = <class 'returnn.util.basic.NotSpecified'>, ff_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, dropout: float = 0.1, num_heads: int = 4, self_att: ~returnn.frontend.attention.RelPosSelfAttention | ~returnn.frontend.module.Module | type | ~typing.Dict[str, ~typing.Any] | ~typing.Any = <class 'returnn.util.basic.NotSpecified'>, att_dropout: float = 0.1, cgmlp: type | ~typing.Dict[str, ~typing.Any] = <class 'returnn.util.basic.NotSpecified'>, cgmlp_ff_dim: ~returnn.tensor.dim.Dim | int = <class 'returnn.util.basic.NotSpecified'>, cgmlp_conv_kernel: int = 31, merge_conv_kernel: int = 3, norm: type | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module | ~typing.Callable = <class 'returnn.frontend.normalization.LayerNorm'>)[source]¶
E-Branchformer layer, e.g. to be used in the
returnn.frontend.encoder.conformer.ConformerEncoder
.See the module docstring
returnn.frontend.encoder.e_branchformer
for an example.By convention, any options to the module are passed to __init__, and potential changing inputs (other tensors) are passed to
__call__()
.
- class returnn.frontend.encoder.e_branchformer.FeedForwardConvGated(out_dim: ~returnn.tensor.dim.Dim, *, ff_dim: ~returnn.tensor.dim.Dim | int = <class 'returnn.util.basic.NotSpecified'>, kernel_size: int = 31, dropout: float = 0.1, activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <function gelu>, gate_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <function identity>, with_bias: bool = True, norm: type | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module | ~typing.Callable = <class 'returnn.frontend.normalization.LayerNorm'>)[source]¶
Convolutional Gating MLP (cgMLP) as introduced in https://openreview.net/forum?id=RA-zVvZLYIy and then used by the E-Branchformer model (https://arxiv.org/pdf/2210.00077). It uses the Convolutional Spatial Gating Unit (CSGU). This is the local extractor branch in the E-Branchformer model.
Related is the
returnn.frontend.decoder.transformer.FeedForwardGated
module.- Parameters:
out_dim – the encoder (e.g. E-Branchformer) model dim. (usually 256 or 512)
ff_dim – intermediate dimension. This is like cgmlp_linear_units/2 in ESPnet. Note the 1/2 factor, which is because in ESPnet, you specify the total dimension, before it is split for the gating, while here, you specify the dimension for the gating part. Common settings are 2048/2 or 3072/2. In the paper, they mention a factor of 3 of the model dimension (factor 6 for ESPnet setting).
kernel_size – for the depthwise convolution (usually 31)
dropout
activation – activation function after the first linear layer, for both parts. default as in the paper: gelu. Note, in
returnn.frontend.decoder.transformer.FeedForwardGated
, theactivation
arg is likegate_activation
here.gate_activation – activation function for the gate part, before the gating (mult) is applied. default as in the paper: identity. Note, in
returnn.frontend.decoder.transformer.FeedForwardGated
, theactivation
arg is likegate_activation
here.with_bias – whether to use bias in the linear layers and conv layer. default as in the paper: True.
norm – normalization layer. default as in the paper: LayerNorm.
- class returnn.frontend.encoder.e_branchformer.Merge(*, in_dim1: Dim, in_dim2: Dim, out_dim: Dim, merge_conv_kernel: int = 3)[source]¶
The merge module from the E-Branchformer model.
By convention, any options to the module are passed to __init__, and potential changing inputs (other tensors) are passed to
__call__()
.