Tensor
and Dim
¶
Tensor
¶
This wraps a tf.Tensor
or torch.Tensor
by adding a lot of meta information about it
and its axes.
This is all in the returnn.tensor.Tensor
class.
This was introduced with the TF backend in 2016. The idea and concept is also explained in the slides of our Interspeech 2020 tutorial about machine learning frameworks including RETURNN.
It is conceptually similar to named tensors / named axes
in other frameworks,
but goes much beyond that by having lots of other meta information
about a tensor and its axes.
Also, an axis name is not simply a string like in other frameworks,
but a returnn.tensor.Dim
object.
Specifically, the information returnn.tensor.Tensor
covers:
Shape
Dimension tags for each axis (
returnn.tensor.Dim
), see belowSpecific handling of batch axis
Default spatial/time axis
Default feature axis
Shape itself
Sequence lengths (tensor of shape [Batch]) for each variable-length axis (can have multiple variable-length axes)
Data type (float, int, string, …)
Categorical data flag, i.e. data represents class indices (implies
int
data type)Number of classes
Vocabulary for classes
Beam search information (beam scores, beam source indices for traceback) (
returnn.tf.util.data.SearchBeam
)Flag whether data is available at decoding/inference time
returnn.tensor.Tensor
is the main tensor object
used in the _returnn_frontend.
returnn.tensor.Tensor
is also used everywhere in the TF backend of RETURNN.
Specifically, the inputs/outputs of layers are returnn.tensor.Tensor
.
Layers and RETURNN frontend modules and functions are flexible w.r.t. the input format:
Order of axis should not matter. The specific operation will be done on the logical axis (e.g.
returnn.tf.layers.basic.LinearLayer
operates on the feature dimension).Any code can potentially change the order of axes for efficiency.
[Time,Batch,Feature] is more efficient for RNNs
[Batch,Feature,Time] is more efficient for CNNs
[Batch,Time,Feature] is the default
Dim
¶
A returnn.tensor.Dim
object,
representing a dimension (axis) of a returnn.tensor.Tensor
object.
We also refer to this as dimension tag,
as it covers more meta information than just the size.
It stores:
Static size, or
None
representing dynamic sizes(Sequence) lengths in case of dynamic sizes. Usually, these are per batch entry, i.e. of shape [Batch]. However, this is not a requirement, and they can also have any shape. In fact, the dynamic size is again another
returnn.tensor.Tensor
object.Optional some vocabulary
Its kind: batch, spatial or feature (although in most cases there is no real difference between spatial or feature)
Many layers allow to specify a custom dimension tag as output,
via out_dim
or similar options.
See #597.
It is possible to perform elementary algebra on dimension tags
such as addition, subtraction, multiplication and division.
These operations are not commutative,
i.e. a + b != b + a
and a * b != b * a
,
because the order of concatenation and merging dimensions matters
and vice versa for splitting features and splitting dimensions.
We support equality for simple identities
like 2 * a == a + a
(but 2 * a != a * 2
),
(a + b) * c == a * c + b * c
,
a * b // b == a
.
See #853.
See test_dim_math_...
functions for examples.
We provide a global batch dim object (returnn.tf.util.data.batch_dim
)
which can be used to avoid creating a new batch dim object every time,
although it does not matter as we treat all batch dims as equal.
Any logic regarding the batch dim (such as beam search) is handled separately.
In a user config, the dim tags are usually introduced already for extern_data
.
Example:
from returnn.tf.util.data import batch_dim, SpatialDim, FeatureDim
input_seq_dim = SpatialDim("input-seq-len")
input_feat_dim = FeatureDim("input-feature", 40)
target_seq_dim = SpatialDim("target-seq-len")
target_classes_dim = FeatureDim("target-classes", 1000)
extern_data = {
"data": {
"dim_tags": [batch_dim, input_seq_dim, input_feat_dim]},
"classes": {
"dim_tags": [batch_dim, target_seq_dim],
"sparse_dim": target_classes_dim},
}
All layers which accept some axis
or in_dim
argument also can be given some dim object
instead of using some text description (like "T"
or "F"
).
A dimension tag object is usually more robust than relying on such textual description
and is the recommended way.
You can specify out_shape
for any layer to verify the output shape
via dimension tags.
See #706.
Example usages¶
See Managing Axes.
returnn.tf.layers.basic.SoftmaxOverSpatialLayer
could be used like
"att_weights": {"class": "softmax_over_spatial", "from": "energy"}
This would use the default time axis of the energy.
Or:
"att_weights": {"class": "softmax_over_spatial", "from": "energy", "axis": "stag:encoder"}
This would use the dimension tag called “encoder”.
returnn.tf.layers.basic.ReduceLayer
, example doing max over the encoder time axis:
"output": {"class": "reduce", "axis": "stag:encoder", "mode": "max", "from": "encoder"}
Current shortcomings¶
The logic to define the default time/feature axes can be ambiguous in some (rare, exotic) cases. Thus, when you use
"axis": "T"
in your code, and the tensor has multiple time/spatial axes, it sometimes can lead to unexpected behavior. This might be a problem also for all layers which operate on the feature dim axis, such asreturnn.tf.layers.basic.LinearLayer
and many others. (Although in most cases, there is no ambiguity about it…)