Tensor and Dim


This wraps a tf.Tensor or torch.Tensor by adding a lot of meta information about it and its axes. This is all in the returnn.tensor.Tensor class.

This was introduced with the TF backend in 2016. The idea and concept is also explained in the slides of our Interspeech 2020 tutorial about machine learning frameworks including RETURNN.

It is conceptually similar to named tensors / named axes in other frameworks, but goes much beyond that by having lots of other meta information about a tensor and its axes. Also, an axis name is not simply a string like in other frameworks, but a returnn.tensor.Dim object.

Specifically, the information returnn.tensor.Tensor covers:

  • Shape

    • Dimension tags for each axis (returnn.tensor.Dim), see below

    • Specific handling of batch axis

    • Default spatial/time axis

    • Default feature axis

    • Shape itself

  • Sequence lengths (tensor of shape [Batch]) for each variable-length axis (can have multiple variable-length axes)

  • Data type (float, int, string, …)

  • Categorical data flag, i.e. data represents class indices (implies int data type)

    • Number of classes

    • Vocabulary for classes

  • Beam search information (beam scores, beam source indices for traceback) (returnn.tf.util.data.SearchBeam)

  • Flag whether data is available at decoding/inference time

returnn.tensor.Tensor is the main tensor object used in the _returnn_frontend. returnn.tensor.Tensor is also used everywhere in the TF backend of RETURNN. Specifically, the inputs/outputs of layers are returnn.tensor.Tensor.

Layers and RETURNN frontend modules and functions are flexible w.r.t. the input format:

  • Order of axis should not matter. The specific operation will be done on the logical axis (e.g. returnn.tf.layers.basic.LinearLayer operates on the feature dimension).

  • Any code can potentially change the order of axes for efficiency.

    • [Time,Batch,Feature] is more efficient for RNNs

    • [Batch,Feature,Time] is more efficient for CNNs

    • [Batch,Time,Feature] is the default


A returnn.tensor.Dim object, representing a dimension (axis) of a returnn.tensor.Tensor object. We also refer to this as dimension tag, as it covers more meta information than just the size.

It stores:

  • Static size, or None representing dynamic sizes

  • (Sequence) lengths in case of dynamic sizes. Usually, these are per batch entry, i.e. of shape [Batch]. However, this is not a requirement, and they can also have any shape. In fact, the dynamic size is again another returnn.tensor.Tensor object.

  • Optional some vocabulary

  • Its kind: batch, spatial or feature (although in most cases there is no real difference between spatial or feature)

Many layers allow to specify a custom dimension tag as output, via out_dim or similar options. See #597.

It is possible to perform elementary algebra on dimension tags such as addition, subtraction, multiplication and division. These operations are not commutative, i.e. a + b != b + a and a * b != b * a, because the order of concatenation and merging dimensions matters and vice versa for splitting features and splitting dimensions. We support equality for simple identities like 2 * a == a + a (but 2 * a != a * 2), (a + b) * c == a * c + b * c, a * b // b == a. See #853. See test_dim_math_... functions for examples.

We provide a global batch dim object (returnn.tf.util.data.batch_dim) which can be used to avoid creating a new batch dim object every time, although it does not matter as we treat all batch dims as equal. Any logic regarding the batch dim (such as beam search) is handled separately.

In a user config, the dim tags are usually introduced already for extern_data. Example:

from returnn.tf.util.data import batch_dim, SpatialDim, FeatureDim
input_seq_dim = SpatialDim("input-seq-len")
input_feat_dim = FeatureDim("input-feature", 40)
target_seq_dim = SpatialDim("target-seq-len")
target_classes_dim = FeatureDim("target-classes", 1000)

extern_data = {
    "data": {
        "dim_tags": [batch_dim, input_seq_dim, input_feat_dim]},
    "classes": {
        "dim_tags": [batch_dim, target_seq_dim],
        "sparse_dim": target_classes_dim},

All layers which accept some axis or in_dim argument also can be given some dim object instead of using some text description (like "T" or "F"). A dimension tag object is usually more robust than relying on such textual description and is the recommended way.

You can specify out_shape for any layer to verify the output shape via dimension tags. See #706.

Example usages

See Managing Axes.

returnn.tf.layers.basic.SoftmaxOverSpatialLayer could be used like

"att_weights": {"class": "softmax_over_spatial", "from": "energy"}

This would use the default time axis of the energy.


"att_weights": {"class": "softmax_over_spatial", "from": "energy", "axis": "stag:encoder"}

This would use the dimension tag called “encoder”.

returnn.tf.layers.basic.ReduceLayer, example doing max over the encoder time axis:

"output": {"class": "reduce", "axis": "stag:encoder", "mode": "max", "from": "encoder"}


Current shortcomings

  • The logic to define the default time/feature axes can be ambiguous in some (rare, exotic) cases. Thus, when you use "axis": "T" in your code, and the tensor has multiple time/spatial axes, it sometimes can lead to unexpected behavior. This might be a problem also for all layers which operate on the feature dim axis, such as returnn.tf.layers.basic.LinearLayer and many others. (Although in most cases, there is no ambiguity about it…)