returnn.frontend.encoder.conformer

Conformer model, variant of Transformer with additional convolution, introduced for speech recognition. Ref: https://arxiv.org/abs/2005.08100

About details of the specific implementation and other implementations, see: https://github.com/rwth-i6/returnn_common/issues/233

class returnn.frontend.encoder.conformer.ConformerPositionwiseFeedForward(out_dim: Dim, *, ff_dim: Dim, dropout: float, activation: Callable[[Tensor], Tensor])[source]
Conformer position-wise feedforward neural network layer

FF -> Activation -> Dropout -> FF

Parameters:
  • out_dim – output feature dimension

  • ff_dim – dimension of the feed-forward layers

  • dropout – dropout value

  • activation – activation function

class returnn.frontend.encoder.conformer.ConformerConvBlock(out_dim: Dim, *, kernel_size: int, norm: BatchNorm | Any)[source]
Conformer convolution block

FF -> GLU -> depthwise conv -> BN -> Swish -> FF

Parameters:
  • out_dim – output feature dimension

  • kernel_size – kernel size of depthwise convolution

  • norm – Batch norm originally

class returnn.frontend.encoder.conformer.ConformerConvSubsample(in_dim: ~returnn.tensor.dim.Dim, *, out_dims: ~typing.List[~returnn.tensor.dim.Dim], filter_sizes: ~typing.List[int | ~typing.Tuple[int, int]], strides: ~typing.List[int | ~typing.Tuple[int, int]] | None = None, pool_sizes: ~typing.List[~typing.Tuple[int, int]] | None = None, activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] = <function relu>, padding: str = 'same')[source]

Conv 2D block with optional max-pooling or striding.

References:

To get the ESPnet case, for example Conv2dSubsampling6, use these options (out_dim is the model dim of the encoder)

out_dims=[out_dim, out_dim], # ESPnet standard, but this might be too large filter_sizes=[3, 5], strides=[2, 3], padding=”valid”,

Parameters:
  • out_dims – the number of output channels. last element is the output feature dimension

  • filter_sizes – a list of filter sizes for the conv layer

  • pool_sizes – a list of pooling factors applied after conv layer

  • activation – the activation function

  • padding – ‘same’ or ‘valid’

class returnn.frontend.encoder.conformer.ConformerEncoderLayer(out_dim: ~returnn.tensor.dim.Dim = Dim{'conformer-enc-default-out-dim'(512)}, *, ff_dim: ~returnn.tensor.dim.Dim = <class 'returnn.util.basic.NotSpecified'>, ff_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] = <function silu>, dropout: float = 0.1, conv_kernel_size: int = 32, conv_norm: ~returnn.frontend.normalization.BatchNorm | type | ~typing.Any = <class 'returnn.util.basic.NotSpecified'>, conv_norm_opts: ~typing.Dict[str, ~typing.Any] | None = None, num_heads: int = 4, self_att: ~returnn.frontend.attention.RelPosSelfAttention | ~returnn.frontend.module.Module | type | ~typing.Any | None = None, self_att_opts: ~typing.Dict[str, ~typing.Any] | None = None, att_dropout: float = 0.1)[source]

Represents a conformer block

Parameters:
  • out_dim – the output feature dimension

  • ff_dim – the dimension of feed-forward layers. 2048 originally, or 4 times out_dim

  • ff_activation – activation function for feed-forward network

  • dropout – the dropout value for the FF block

  • conv_kernel_size – the kernel size of depthwise convolution in the conv block

  • conv_norm – used for the conv block. Batch norm originally

  • conv_norm_opts

    for nn.BatchNorm or other conv_norm type. In case of nn.BatchNorm, uses use_mask=False by default.

    use_mask means whether to properly mask the spatial dim in batch norm. Most existing implementations don’t do this. Except of RETURNN. It’s faster when you don’t do this.

  • num_heads – the number of attention heads

  • self_att – the self-attention layer. RelPosSelfAttention originally and default

  • self_att_opts – options for the self-attention layer, for nn.RelPosSelfAttention

  • att_dropout – attention dropout value

class returnn.frontend.encoder.conformer.ConformerEncoder(in_dim: ~returnn.tensor.dim.Dim, out_dim: ~returnn.tensor.dim.Dim = Dim{'conformer-enc-default-out-dim'(512)}, *, num_layers: int, input_layer: ~returnn.frontend.encoder.conformer.ConformerConvSubsample | ~returnn.frontend.encoder.base.ISeqDownsamplingEncoder | ~returnn.frontend.module.Module | ~typing.Any, input_dropout: float = 0.1, ff_dim: ~returnn.tensor.dim.Dim = <class 'returnn.util.basic.NotSpecified'>, ff_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] = <function silu>, dropout: float = 0.1, conv_kernel_size: int = 32, conv_norm: ~returnn.frontend.normalization.BatchNorm | type | ~typing.Any = <class 'returnn.util.basic.NotSpecified'>, num_heads: int = 4, att_dropout: float = 0.1, encoder_layer: ~returnn.frontend.encoder.conformer.ConformerEncoderLayer | ~returnn.frontend.module.Module | type | ~typing.Any | None = None, encoder_layer_opts: ~typing.Dict[str, ~typing.Any] | None = None, sequential=<class 'returnn.frontend.container.Sequential'>)[source]

Represents Conformer encoder architecture

Parameters:
  • out_dim – the output feature dimension

  • num_layers – the number of encoder layers

  • input_layer – input/frontend/prenet with potential subsampling. (x, in_spatial_dim) -> (y, out_spatial_dim)

  • input_dropout – applied after input_projection(input_layer(x))

  • ff_dim – the dimension of feed-forward layers. 2048 originally, or 4 times out_dim

  • ff_activation – activation function for feed-forward network

  • dropout – the dropout value for the FF block

  • conv_kernel_size – the kernel size of depthwise convolution in the conv block

  • conv_norm – used for the conv block. Batch norm originally

  • num_heads – the number of attention heads

  • att_dropout – attention dropout value

  • encoder_layer – an instance of ConformerEncoderLayer or similar

  • encoder_layer_opts – options for the encoder layer

  • sequential