This wraps a tf.Tensor by adding a lot of meta information about it and its axes. This is all in the class.

This was introduced with the TF backend in 2016. The idea and concept is also explained in the slides of our Interspeech 2020 tutorial about machine learning frameworks including RETURNN.

It is conceptually similar to named tensors / named axes in other frameworks, but goes much beyond that by having lots of other meta information about a tensor and its axes. Also, an axis name is not simply a string like in other frameworks, but a object.

Specifically, the information covers:

  • Shape
  • Sequence lengths (tensor of shape [Batch]) for each variable-length axis (can have multiple variable-length axes)
  • Data type (float, int, string, …)
  • Categorical data flag, i.e. data represents class indices (implies int data type)
    • Number of classes
    • Vocabulary for classes
  • Beam search information (beam scores, beam source indices for traceback) (
  • Flag whether data is available at decoding/inference time is used everywhere in the TF backend of RETURNN. Specifically, the inputs/outputs of layers are

Layers are flexible w.r.t. the input format:

  • Order of axis should not matter. The specific operation will be done on the logical axis (e.g. operates on the feature dimension).
  • A layer potentially changes the order of axes for efficiency.
    • [Time,Batch,Feature] is more efficient for RNNs
    • [Batch,Feature,Time] is more efficient for CNNs
    • [Batch,Time,Feature] is the default

Example usages

See Managing Axes. could be used like

"att_weights": {"class": "softmax_over_spatial", "from": "energy"}

This would use the default time axis of the energy.


"att_weights": {"class": "softmax_over_spatial", "from": "energy", "axis": "stag:encoder"}

This would use the dimension tag called “encoder”., example doing max over the encoder time axis:

"output": {"class": "reduce", "axis": "stag:encoder", "mode": "max", "from": "encoder"}

Current shortcomings

  • Currently the matching / identification of dimension tags is by partial string matching, which is hacky, and could potentially also lead to bugs. See Managing Axes. In the future, we probably should make this more explicit by using the object instance explicitly.
  • The logic to define the default time/feature axes can be ambiguous in some (rare, exotic) cases. Thus, when you use "axis": "T" in your code, and the tensor has multiple time/spatial axes, it sometimes can lead to unexpected behavior. This might be a problem also for all layers which operate on the feature dim axis, such as and many others. (Although in most cases, there is no ambiguity about it…)
  • There are sometimes cases where layers are dependent on the order of the axis. Examples:
    • The order of the spatial axes matters. You define a kernel shape, and the first entry corresponds to the first spatial axis, etc.
    • The order of the merged axes matters. (Unless you specify the option keep_order, in which cases the input order does not matter, and just the order of what is specified in the config matters.)
  • New dim tags are currently created in the __init__ of a layer, but they should be created (uniquely) by get_out_data_from_opts.
  • Static dimensions are not consistently handled via dim tags yet.