returnn.frontend.encoder.conformer_v2

Conformer model, variant of Transformer with additional convolution, introduced for speech recognition. Ref: https://arxiv.org/abs/2005.08100

About details of the specific implementation and other implementations, see: https://github.com/rwth-i6/returnn_common/issues/233

V2: Split frontend and main encoder.

class returnn.frontend.encoder.conformer_v2.ConformerFrontend(in_dim: ~returnn.tensor.dim.Dim, out_dim: ~returnn.tensor.dim.Dim = Dim{'conformer-enc-default-out-dim'(512)}, *, input_layer: ~returnn.frontend.encoder.conformer.ConformerConvSubsample | ~returnn.frontend.encoder.base.ISeqDownsamplingEncoder | ~returnn.frontend.module.Module | ~typing.Any | None, input_embedding_scale: float = 1.0, input_dropout: float = 0.1)[source]

This is just the combination of: - input_layer (ConformerConvSubsample) - input_projection (Linear without bias) - input_embedding_scale - input_dropout

This is intended to be used together with ConformerEncoderV2.

Parameters:
  • in_dim – input features (e.g. MFCC)

  • out_dim – the output feature dimension

  • input_layer – input/frontend/prenet with potential subsampling. (x, in_spatial_dim) -> (y, out_spatial_dim)

  • input_embedding_scale – applied after input_layer. 1.0 by default for historic reasons. In std Transformer, also ESPnet E-Branchformer and Conformer, this is sqrt(out_dim).

  • input_dropout – applied after input_projection(input_layer(x))

class returnn.frontend.encoder.conformer_v2.ConformerEncoderV2(out_dim: ~returnn.tensor.dim.Dim = Dim{'conformer-enc-default-out-dim'(512)}, *, num_layers: int, ff_dim: ~returnn.tensor.dim.Dim = <class 'returnn.util.basic.NotSpecified'>, ff_activation: ~typing.Callable[[~returnn.tensor.tensor.Tensor], ~returnn.tensor.tensor.Tensor] | ~typing.Dict[str, ~typing.Any] | ~returnn.frontend.module.Module = <class 'returnn.util.basic.NotSpecified'>, dropout: float = 0.1, conv_kernel_size: int = <class 'returnn.util.basic.NotSpecified'>, conv_norm: ~returnn.frontend.normalization.BatchNorm | type | ~typing.Dict[str, ~typing.Any] | ~typing.Any = <class 'returnn.util.basic.NotSpecified'>, num_heads: int = 4, att_dropout: float = 0.1, encoder_layer: ~returnn.frontend.encoder.conformer.ConformerEncoderLayer | ~returnn.frontend.module.Module | type | ~typing.Dict[str, ~typing.Any] | ~typing.Any | None = None, encoder_layer_opts: ~typing.Dict[str, ~typing.Any] | None = None, sequential=<class 'returnn.frontend.container.Sequential'>)[source]

Conformer encoder, without frontend.

V2: Without the input_layer / frontend module, i.e. just the conformer layers. Use ConformerFrontend for the frontend. To get the V1 case, add this in front:

Parameters:
  • out_dim – the output feature dimension

  • num_layers – the number of encoder layers

  • ff_dim – the dimension of feed-forward layers. 2048 originally, or 4 times out_dim

  • ff_activation – activation function for feed-forward network

  • dropout – the dropout value for the FF block

  • conv_kernel_size – the kernel size of depthwise convolution in the conv block

  • conv_norm – used for the conv block. Batch norm originally

  • num_heads – the number of attention heads

  • att_dropout – attention dropout value

  • encoder_layer – an instance of ConformerEncoderLayer or similar

  • encoder_layer_opts – options for the encoder layer

  • sequential