`returnn.frontend.encoder.conformer_v2`¶

Conformer model, variant of Transformer with additional convolution, introduced for speech recognition. Ref: https://arxiv.org/abs/2005.08100

About details of the specific implementation and other implementations, see: https://github.com/rwth-i6/returnn_common/issues/233

V2: Split frontend and main encoder.

class returnn.frontend.encoder.conformer_v2.ConformerFrontend(in_dim: ~returnn.tensor.dim.Dim, out_dim: ~returnn.tensor.dim.Dim = Dim{'conformer-enc-default-out-dim'(512)}, *, input_layer: ~returnn.frontend.encoder.conformer.ConformerConvSubsample | ~returnn.frontend.encoder.base.ISeqDownsamplingEncoder | ~returnn.frontend.module.Module | ~typing.Any | None, input_embedding_scale: float = 1.0, input_dropout: float = 0.1)[source]¶

This is just the combination of: - input_layer (ConformerConvSubsample) - input_projection (Linear without bias) - input_embedding_scale - input_dropout

This is intended to be used together with ConformerEncoderV2.

Parameters:

in_dim – input features (e.g. MFCC)
out_dim – the output feature dimension
input_layer – input/frontend/prenet with potential subsampling. (x, in_spatial_dim) -> (y, out_spatial_dim)
input_embedding_scale – applied after input_layer. 1.0 by default for historic reasons. In std Transformer, also ESPnet E-Branchformer and Conformer, this is sqrt(out_dim).
input_dropout – applied after input_projection(input_layer(x))

class returnn.frontend.encoder.conformer_v2.ConformerEncoderV2(out_dim: ~returnn.tensor.dim.Dim = Dim{'conformer-enc-default-out-dim'(512)}, *, num_layers: int, ff_dim: ~returnn.tensor.dim.Dim = <class 'returnn.util.basic.NotSpecified'>, ff_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, dropout: float = 0.1, conv_kernel_size: int = <class 'returnn.util.basic.NotSpecified'>, conv_norm: ~returnn.frontend.normalization.BatchNorm | type | ~typing.Dict[str, ~typing.Any] | ~typing.Any = <class 'returnn.util.basic.NotSpecified'>, num_heads: int = 4, att_dropout: float = 0.1, encoder_layer: ~returnn.frontend.encoder.conformer.ConformerEncoderLayer | ~returnn.frontend.module.Module | type | ~typing.Dict[str, ~typing.Any] | ~typing.Any | None = None, encoder_layer_opts: ~typing.Dict[str, ~typing.Any] | None = None, sequential=<class 'returnn.frontend.container.Sequential'>)[source]¶

Conformer encoder, without frontend.

V2: Without the input_layer / frontend module, i.e. just the conformer layers. Use ConformerFrontend for the frontend. To get the V1 case, add this in front:

Parameters:

out_dim – the output feature dimension
num_layers – the number of encoder layers
ff_dim – the dimension of feed-forward layers. 2048 originally, or 4 times out_dim
ff_activation – activation function for feed-forward network
dropout – the dropout value for the FF block
conv_kernel_size – the kernel size of depthwise convolution in the conv block
conv_norm – used for the conv block. Batch norm originally
num_heads – the number of attention heads
att_dropout – attention dropout value
encoder_layer – an instance of ConformerEncoderLayer or similar
encoder_layer_opts – options for the encoder layer
sequential

returnn.frontend.encoder.conformer_v2¶

`returnn.frontend.encoder.conformer_v2`¶