`returnn.frontend.encoder.conformer`¶

Conformer model, variant of Transformer with additional convolution, introduced for speech recognition. Ref: https://arxiv.org/abs/2005.08100

About details of the specific implementation and other implementations, see: https://github.com/rwth-i6/returnn_common/issues/233

class returnn.frontend.encoder.conformer.ConformerPositionwiseFeedForward(out_dim: Dim, *, ff_dim: Dim, dropout: float, activation: Callable[[Tensor], Tensor])[source]¶

Conformer position-wise feedforward neural network layer: FF -> Activation -> Dropout -> FF

Parameters:

out_dim – output feature dimension
ff_dim – dimension of the feed-forward layers
dropout – dropout value
activation – activation function

class returnn.frontend.encoder.conformer.ConformerConvBlock(out_dim: Dim, *, kernel_size: int, norm: BatchNorm | Any)[source]¶

Conformer convolution block: FF -> GLU -> depthwise conv -> BN -> Swish -> FF

Parameters:

out_dim – output feature dimension
kernel_size – kernel size of depthwise convolution
norm – Batch norm originally

class returnn.frontend.encoder.conformer.ConformerConvSubsample(in_dim: ~returnn.tensor.dim.Dim, *, out_dims: ~typing.List[~returnn.tensor.dim.Dim], filter_sizes: ~typing.List[int | ~typing.Tuple[int, int]], strides: ~typing.List[int | ~typing.Tuple[int, int]] | None = None, pool_sizes: ~typing.List[~typing.Tuple[int, int]] | None = None, activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] = <function relu>, padding: str = 'same')[source]¶

Conv 2D block with optional max-pooling or striding.

References:

https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet/nets/pytorch_backend/transformer/subsampling.py#L162 https://github.com/rwth-i6/returnn-experiments/blob/5852e21f44d5450909dee29d89020f1b8d36aa68/2022-swb-conformer-hybrid-sat/table_1_and_2/reduced_dim.config#L226 (actually the latter is different…)

To get the ESPnet case, for example Conv2dSubsampling6, use these options (out_dim is the model dim of the encoder)

out_dims=[out_dim, out_dim], # ESPnet standard, but this might be too large filter_sizes=[3, 5], strides=[2, 3], padding=”valid”,

Parameters:

out_dims – the number of output channels. last element is the output feature dimension
filter_sizes – a list of filter sizes for the conv layer
pool_sizes – a list of pooling factors applied after conv layer
activation – the activation function
padding – ‘same’ or ‘valid’

class returnn.frontend.encoder.conformer.ConformerEncoderLayer(out_dim: ~returnn.tensor.dim.Dim = Dim{'conformer-enc-default-out-dim'(512)}, *, ff_dim: ~returnn.tensor.dim.Dim = <class 'returnn.util.basic.NotSpecified'>, ff_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] = <function silu>, dropout: float = 0.1, conv_kernel_size: int = 32, conv_norm: ~returnn.frontend.normalization.BatchNorm | type | ~typing.Any = <class 'returnn.util.basic.NotSpecified'>, conv_norm_opts: ~typing.Dict[str, ~typing.Any] | None = None, num_heads: int = 4, self_att: ~returnn.frontend.attention.RelPosSelfAttention | ~returnn.frontend.module.Module | type | ~typing.Any | None = None, self_att_opts: ~typing.Dict[str, ~typing.Any] | None = None, att_dropout: float = 0.1)[source]¶

Represents a conformer block

Parameters:

out_dim – the output feature dimension
ff_dim – the dimension of feed-forward layers. 2048 originally, or 4 times out_dim
ff_activation – activation function for feed-forward network
dropout – the dropout value for the FF block
conv_kernel_size – the kernel size of depthwise convolution in the conv block
conv_norm – used for the conv block. Batch norm originally
conv_norm_opts –
for nn.BatchNorm or other conv_norm type. In case of nn.BatchNorm, uses use_mask=False by default.

use_mask means whether to properly mask the spatial dim in batch norm. Most existing implementations don’t do this. Except of RETURNN. It’s faster when you don’t do this.
num_heads – the number of attention heads
self_att – the self-attention layer. RelPosSelfAttention originally and default
self_att_opts – options for the self-attention layer, for nn.RelPosSelfAttention
att_dropout – attention dropout value

class returnn.frontend.encoder.conformer.ConformerEncoder(in_dim: ~returnn.tensor.dim.Dim, out_dim: ~returnn.tensor.dim.Dim = Dim{'conformer-enc-default-out-dim'(512)}, *, num_layers: int, input_layer: ~returnn.frontend.encoder.conformer.ConformerConvSubsample | ~returnn.frontend.encoder.base.ISeqDownsamplingEncoder | ~returnn.frontend.module.Module | ~typing.Any, input_dropout: float = 0.1, ff_dim: ~returnn.tensor.dim.Dim = <class 'returnn.util.basic.NotSpecified'>, ff_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] = <function silu>, dropout: float = 0.1, conv_kernel_size: int = 32, conv_norm: ~returnn.frontend.normalization.BatchNorm | type | ~typing.Any = <class 'returnn.util.basic.NotSpecified'>, num_heads: int = 4, att_dropout: float = 0.1, encoder_layer: ~returnn.frontend.encoder.conformer.ConformerEncoderLayer | ~returnn.frontend.module.Module | type | ~typing.Any | None = None, encoder_layer_opts: ~typing.Dict[str, ~typing.Any] | None = None, sequential=<class 'returnn.frontend.container.Sequential'>)[source]¶

Represents Conformer encoder architecture

Parameters:

out_dim – the output feature dimension
num_layers – the number of encoder layers
input_layer – input/frontend/prenet with potential subsampling. (x, in_spatial_dim) -> (y, out_spatial_dim)
input_dropout – applied after input_projection(input_layer(x))
ff_dim – the dimension of feed-forward layers. 2048 originally, or 4 times out_dim
ff_activation – activation function for feed-forward network
dropout – the dropout value for the FF block
conv_kernel_size – the kernel size of depthwise convolution in the conv block
conv_norm – used for the conv block. Batch norm originally
num_heads – the number of attention heads
att_dropout – attention dropout value
encoder_layer – an instance of ConformerEncoderLayer or similar
encoder_layer_opts – options for the encoder layer
sequential

returnn.frontend.encoder.conformer¶

`returnn.frontend.encoder.conformer`¶