returnn.frontend.encoder.conformer
¶
Conformer model, variant of Transformer with additional convolution, introduced for speech recognition. Ref: https://arxiv.org/abs/2005.08100
About details of the specific implementation and other implementations, see: https://github.com/rwth-i6/returnn_common/issues/233
- class returnn.frontend.encoder.conformer.ConformerPositionwiseFeedForward(out_dim: ~returnn.tensor.dim.Dim, *, ff_dim: ~returnn.tensor.dim.Dim | int = <class 'returnn.util.basic.NotSpecified'>, dropout: float = 0.1, activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <function silu>, **kwargs)[source]¶
- Conformer position-wise feedforward neural network layer
FF -> Activation -> Dropout -> FF
- Parameters:
out_dim – output feature dimension
ff_dim – dimension of the feed-forward layers
dropout – dropout value
activation – activation function. swish by default, unlike the base
FeedForward
- class returnn.frontend.encoder.conformer.ConformerConvBlock(out_dim: Dim, *, kernel_size: int, norm: BatchNorm | Any)[source]¶
- Conformer convolution block
FF -> GLU -> depthwise conv -> BN -> Swish -> FF
- Parameters:
out_dim – output feature dimension
kernel_size – kernel size of depthwise convolution
norm – Batch norm originally
- class returnn.frontend.encoder.conformer.ConformerConvSubsample(in_dim: ~returnn.tensor.dim.Dim, *, out_dims: ~typing.List[~returnn.tensor.dim.Dim | int], filter_sizes: ~typing.List[int | ~typing.Tuple[int, int]], strides: ~typing.List[int | ~typing.Tuple[int, int]] | None = None, pool_sizes: ~typing.List[~typing.Tuple[int, int]] | None = None, activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] = <function relu>, padding: str = 'same')[source]¶
Conv 2D block with optional max-pooling or striding.
References:
https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet/nets/pytorch_backend/transformer/subsampling.py#L162 https://github.com/rwth-i6/returnn-experiments/blob/5852e21f44d5450909dee29d89020f1b8d36aa68/2022-swb-conformer-hybrid-sat/table_1_and_2/reduced_dim.config#L226 (actually the latter is different…)
To get the ESPnet case, for example Conv2dSubsampling6, use these options (out_dim is the model dim of the encoder)
out_dims=[out_dim, out_dim], # ESPnet standard, but this might be too large filter_sizes=[3, 5], strides=[2, 3], padding=”valid”,
- Parameters:
out_dims – the number of output channels. last element is the output feature dimension
filter_sizes – a list of filter sizes for the conv layer
pool_sizes – a list of pooling factors applied after conv layer
activation – the activation function
padding – ‘same’ or ‘valid’
- class returnn.frontend.encoder.conformer.ConformerEncoderLayer(out_dim: ~returnn.tensor.dim.Dim = Dim{'conformer-enc-default-out-dim'(512)}, *, ff: type | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, ff_dim: ~returnn.tensor.dim.Dim = <class 'returnn.util.basic.NotSpecified'>, ff_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, dropout: float = 0.1, conv_kernel_size: int = 32, conv_norm: ~returnn.frontend.normalization.BatchNorm | type | ~typing.Dict[str, ~typing.Any] | ~typing.Any = <class 'returnn.util.basic.NotSpecified'>, conv_norm_opts: ~typing.Dict[str, ~typing.Any] | None = None, num_heads: int = 4, self_att: ~returnn.frontend.attention.RelPosSelfAttention | ~returnn.frontend.module.Module | type | ~typing.Dict[str, ~typing.Any] | ~typing.Any | None = None, self_att_opts: ~typing.Dict[str, ~typing.Any] | None = None, att_dropout: float = 0.1, norm: type | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module | ~typing.Callable = <class 'returnn.frontend.normalization.LayerNorm'>)[source]¶
Represents a conformer block
- Parameters:
out_dim – the output feature dimension
ff_dim – the dimension of feed-forward layers. 2048 originally, or 4 times out_dim
ff_activation – activation function for feed-forward network
dropout – the dropout value for the FF block
conv_kernel_size – the kernel size of depthwise convolution in the conv block
conv_norm – used for the conv block. Batch norm originally
conv_norm_opts –
for nn.BatchNorm or other conv_norm type. In case of nn.BatchNorm, uses use_mask=False by default.
use_mask means whether to properly mask the spatial dim in batch norm. Most existing implementations don’t do this. Except of RETURNN. It’s faster when you don’t do this.
num_heads – the number of attention heads
self_att – the self-attention layer. RelPosSelfAttention originally and default
self_att_opts – options for the self-attention layer, for
nn.RelPosSelfAttention
att_dropout – attention dropout value
norm – pre-normalization for FF, conv and attention blocks
- class returnn.frontend.encoder.conformer.ConformerEncoder(in_dim: ~returnn.tensor.dim.Dim, out_dim: ~returnn.tensor.dim.Dim | int = Dim{'conformer-enc-default-out-dim'(512)}, *, num_layers: int, input_layer: ~returnn.frontend.encoder.conformer.ConformerConvSubsample | ~returnn.frontend.encoder.base.ISeqDownsamplingEncoder | ~returnn.frontend.module.Module | ~typing.Any | None, input_embedding_scale: float = 1.0, input_dropout: float = 0.1, ff_dim: ~returnn.tensor.dim.Dim = <class 'returnn.util.basic.NotSpecified'>, ff_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, dropout: float = 0.1, conv_kernel_size: int = <class 'returnn.util.basic.NotSpecified'>, conv_norm: ~returnn.frontend.normalization.BatchNorm | type | ~typing.Dict[str, ~typing.Any] | ~typing.Any = <class 'returnn.util.basic.NotSpecified'>, num_heads: int = 4, att_dropout: float = 0.1, encoder_layer: ~returnn.frontend.encoder.conformer.ConformerEncoderLayer | ~returnn.frontend.module.Module | type | ~typing.Dict[str, ~typing.Any] | ~typing.Any | None = None, encoder_layer_opts: ~typing.Dict[str, ~typing.Any] | None = None, sequential=<class 'returnn.frontend.container.Sequential'>)[source]¶
Represents Conformer encoder architecture
- Parameters:
in_dim – input features (e.g. MFCC)
out_dim – the output feature dimension
num_layers – the number of encoder layers
input_layer – input/frontend/prenet with potential subsampling. (x, in_spatial_dim) -> (y, out_spatial_dim)
input_embedding_scale – applied after input_layer. 1.0 by default for historic reasons. In std Transformer, also ESPnet E-Branchformer and Conformer, this is sqrt(out_dim).
input_dropout – applied after input_projection(input_layer(x))
ff_dim – the dimension of feed-forward layers. 2048 originally, or 4 times out_dim
ff_activation – activation function for feed-forward network
dropout – the dropout value for the FF block
conv_kernel_size – the kernel size of depthwise convolution in the conv block
conv_norm – used for the conv block. Batch norm originally
num_heads – the number of attention heads
att_dropout – attention dropout value
encoder_layer – an instance of
ConformerEncoderLayer
or similarencoder_layer_opts – options for the encoder layer
sequential
- returnn.frontend.encoder.conformer.make_ff(*, out_dim: Dim, ff: type | Dict[str, Any] | Module, ff_dim: Dim | int, ff_activation: Callable[[Tensor], Tensor] | Dict[str, Any] | Module, dropout: float) ConformerPositionwiseFeedForward | Module [source]¶
make the feed-forward part of the Conformer layer