Welcome to RETURNNÂ¶
RETURNN paper 2016, RETURNN paper 2018.
RETURNN  RWTH extensible training framework for universal recurrent neural networks, is a Theano/TensorFlowbased implementation of modern recurrent neural network architectures. It is optimized for fast and reliable training of recurrent neural networks in a multiGPU environment.
Features include:
 Minibatch training of feedforward neural networks
 Sequencechunking based batch training for recurrent neural networks
 Long shortterm memory recurrent neural networks including our own fast CUDA kernel
 Multidimensional LSTM (GPU only, there is no CPU version)
 Memory management for large data sets
 Work distribution across multiple devices
 Flexible and fast architecture which allows all kinds of encoderattentiondecoder models
See Basic Usage and Technological Overview.
Here is the video recording of a RETURNN overview talk (slides, exercise sheet; hosted by eBay).
There are many example demos which work on artificially generated data, i.e. they should work asis.
There are some realworld examples such as setups for speech recognition on the Switchboard or LibriSpeech corpus.
Some benchmark setups against other frameworks can be found here. The results are in the RETURNN paper 2016. Performance benchmarks of our LSTM kernel vs CuDNN and other TensorFlow kernels are in TensorFlow LSTM Benchmark.
There is also a wiki. Questions can also be asked on StackOverflow using the RETURNN tag.
Some recent development changelog can be seen here.
Technological OverviewÂ¶
RETURNN is a machine learning toolkit that can be used as standalone application or framework for training and running sequential neural network architectures. The main tasks of RETURNN are:
 Network construction via nested dictionaries
 Data loading with predefined and extendable dataset objects
 Automatic management of layer outputs (such as tensor axes and time dimensions) with a Data object
 Support of dynamic training schemes that allow for network structure and parameter changes during training
 Managing the losses and optimizer functions
 Learning rate scheduling based on training scores
RETURNN supports two calculation backends: TensorFlow and Theano. It is recommended to stick to the TensorFlow backend, as Theano is no longer supported.
RETURNN is mostly used as a tool where rnn.py
is the main entry point
but you can also use it as a framework / Python module to use in your own Python code.
To get an idea about how it works, it helps to follow roughly the execution path
starting in rnn
, esp. in rnn.main()
.
In all cases, the code itself should be checked for details and comments.
StructureÂ¶
Many components are implemented separately for both Theano and TensorFlow:
 The engine for highlevel logic, although a bit is shared.
Engine
,EngineTask
for Theano andTFEngine
for TensorFlow. For TensorFlow the engine contains the high level methods for training, forward pass, and other executed tasks. It keeps track of the network, devices, models and the updater function, and is the main connection between all these components.TFEngine
also contains theTFEngine.Runner
which is responsible for managing the TensorFlow session.  Network topology construction which constructs the computation graph
for training or just forwarding.
Network
,TFNetwork
.  Network model update code for training, i.e. SGD etc.
Updater
,TFUpdater
.  All the individual layer implementations.
NetworkLayer
,NetworkBaseLayer
,NetworkHiddenLayer
,NetworkRecurrentLayer
etc for Theano andTFNetworkLayer
,TFNetworkRecLayer
for TensorFlow. This also means that Theano and TensorFlow donât support the same layers and even parameters can be different.  Some utilities
TheanoUtil
andTFUtil
, which contains thereturnn.tf.util.data.Data
class.  MultiGPU logic.
Device
,EngineTask
for Theano and not yet implemented for TensorFlow.
All the rest is shared for all backends, which mostly is:
 The main entry point
rnn
.  Config handling
Config
.  Logging
Log
.  Utilities
Util
.  Dataset reading
Dataset
including all the different dataset implementationsHDFDataset
,SprintDataset
,LmDataset
,GeneratingDataset
,MetaDataset
, etc.  Learning rate scheduling logic such as Newbob
LearningRateControl
.  Pretrain network structure construction
Pretrain
.  The native op code which generates code for ops for both CUDA and CPU shares a common base.
NativeOp
, where TensorFlowspecific code is inTFNativeOp
.
Execution guideÂ¶
rnn.main()
will parse command line arguments and read in a config. Then logging
Log
is initialized, based on verbosity and other settings.  Then it initializes the datasets (
train
,dev
,eval
in config), i.e.Dataset
instances.  Theanoonly:
Device
instances.  The engine, i.e. a
Engine
orTFEngine
instance.  Depending on the
task
option, some engine initialization which also initializes the network computation graph, tech_net_construct.  Then, depending on the
task
option, it might startengine.train
,engine.forward
etc. (Engine.Engine.train()
orTFEngine.Engine.train()
), Training.
Network ConstructionÂ¶
The network structure which defines the model topology is defined by the config network
option,
which is a dict, where each entry is a layer specification, which itself is a dict containing
the kwargs for the specific layer class. E.g.:
network = {
"fw1": {"class": "linear", "activation": "relu", "dropout": 0.1, "n_out": 500},
"fw2": {"class": "linear", "activation": "relu", "dropout": 0.1, "n_out": 500, "from": ["fw1"]},
"output": {"class": "softmax", "loss": "ce", "from": ["fw2"]}
}
The "class"
key will get extracted from the layer arguments and the specific layer class will be used.
For Theano, the base layer class is NetworkBaseLayer.Container
and NetworkBaseLayer.Layer
;
for TensorFlow, it is returnn.tf.layers.base.LayerBase
.
E.g. that would use the TFNetworkLayer.LinearLayer
class,
and the LinearLayer.__init__
will accepts arguments like activation
.
In the given example, all the remaining arguments will get handled by the base layer.
The construction itself can be found for TensorFlow in returnn.tf.network.TFNetwork.construct_from_dict()
,
which starts from the output layers goes over the sources of a layer, which are defined by "from"
.
If a layer does not define "from"
, it will automatically get the input from the dataset data.
Here is a 2 layer unidirectional LSTM network:
network = {
"lstm1": {"class": "rec", "unit": "lstm", "dropout": 0.1, "n_out": 500},
"lstm2": {"class": "rec", "unit": "lstm", "dropout": 0.1, "n_out": 500, "from": ["lstm1"]},
"output": {"class": "softmax", "loss": "ce", "from": ["lstm2"]}
}
In TensorFlow, that would use the layer class TFNetworkRecLayer.RecLayer
which will handle the argument unit
.
TrainingÂ¶
The engine will loop over the epochs and the individual batches / steps and loads and saves the model.
The specific implementation is different in Theano and TensorFlow.
See the code for more details, i.e. Engine
, EngineTask
for Theano and TFEngine
for TensorFlow.
InstallationÂ¶
Installation is easy. Checkout the Git repository of RETURNN (https://github.com/rwthi6/returnn/). Install all dependencies, which are just numpy, h5py, and the backend you want to use (TensorFlow or Theano). You can do so via:
pip install r requirements.txt
You probably want to use pip3
instead of pip
,
and you also might want to add the option user
(if you are not using virtualenv
).
For TensorFlow, use pip install tensorflowgpu
(pip3 install user tensorflowgpu
)
if you have a Nvidia GPU,
and make sure that CUDA and cuDNN can be found.
For Theano, only version 0.9 works (pip install theano==0.9
).
For Theano usage, make sure that you have this in your ~/.theanorc
:
[global]
device = cpu
floatX = float32
For some specific datasets or special layers, additional dependencies might be needed,
such as librosa
.
For running the tests, you need nose
.
You can also install RETURNN as a framework, via pip
(PyPI entry),
like:
pip install returnn
See Basic Usage for the basic usage of RETURNN.
Basic UsageÂ¶
Install RETURNN, Installation.
Now rnn.py
is the main entry point. Usage:
./rnn.py <configfile> [otherparams]
where configfile
is a config file for RETURNN.
See here for an example,
and many more examples from the demos.
The configuration syntax can be in three different forms:
 a simple linebased file with
key value
pairs a JSON file (determined by a â
{
â at the beginning of the file) executable python code (determined by a â
#!
â at the beginning of the file)
Config files using the python code syntax are the defacto standard for all current examples and setups. The parameters can be set by defining global variables, but it is possible to use any form of python code such as functions and classes to construct your network or fill in global variables based on more complex decisions. The python syntax config files may also contain additional code such as layer or dataset definitions.
When calling rnn.py
will execute some task
, such as train
, forward
or search
.
The task train
will train a model specified by a given network structure.
After training each epoch on provided `training data, the current parameters will be stored to a model checkpoint file.
Besides the training data, a development dataset is used to evaluate the current model, and store the evaluation
results in a separate file.
The task forward
will run a forward pass of the network, given an evaluation dataset, and store the results in
an HDF file.
The task search
is used to run the network with the beamsearch algorithm.
The results are serialized into text form and stored in a plain text file python dictionary format file.
The following parameters are very common, and are used in most RETURNN config files:
 task
 The task, such as
train
,forward
orsearch
.  device
 E.g.
gpu
orcpu
. Although RETURNN will automatically detect and use a GPU if available, a specific device can be enforced by setting this parameter.  use_tensorflow
 If you set this to
True
, the TensorFlow will be used. Otherwise, the installed backend is used. If both backends are installed (TensorFlow and Theano), RETURNN will use Theano as default for legacy reasons.  train / dev / eval
The datasets parameters are set to a python dict with a mandatory entry
class
. Theclass
attribute needs to be set to the class name of the dataset that should be used. An overview over available datasets can be found here.train
anddev
are used during training, whileeval
is usually used to define the dataset for theforward
orsearch
task.Beside passing the constructor parameters to the specficic Dataset, there are some common parameters such as:
seq_ordering
: This defines the order of the sequences provided by the dataset. Possible values are:default
: Keep the sequences as isreverse
: Use the default sequences in reversed orderrandom
: Shuffle the data with a predefined fixed seedrandom:<seed>
: Shuffle the data with the seed givensorted
: Sort by length (only if available), beginning with shortest sequencessorted_reverse
: Sort by length, beginning with longest sequenceslaplace:<n_buckets>
: Sort by length with n laplacian buckets (one bucket means going from shortest to longest and back with 1/n of the data).laplace:.<n_sequences>
: sort by length with n sequences per laplacian bucket.
Note that not all sequence order modes are available for all datasets, and some datasets may provide additional modes.
 extern_data
Defines the source/target dimensions of the data. Both can be integers. extern_data can also be a dict if your dataset has other data streams. The standard source data is called â
data
â by default, and the standard target data is called âclasses
â by default. You can also specify whether your data is dense or sparse (i.e. it is just the index), which is specified by the number of dimensions, i.e. 2 (timedim + featuredim) or 1 (just timedim). When using no explicit definition, it is assumed that the data contains a time axis.Example:
extern_data = {"data": [100, 2], "classes": [5000, 1]}
. This defines an input dimension of 100, and the input is dense (2), and an output dimension of 5000, and the output provided by the dataset is sparse (1).For a more explicit definition of the shapes, you can provide a dict instead of a list or tuple. This dict may contain information to create âDataâ objects. For extern_data, only
dim
andshape
are required. Example:'feature_data': {'dim': 80, 'shape': (None, 80)}
This defines 80 dimensional features with a time axis of arbitrary length. Example:'speaker_classes': {'dim': 1172, 'shape': (), 'sparse': True}
This defines a sparse input for e.g. speaker classes that do not have a time axis.In general, all input parameters to
Data
can be provided network
This is a dict which defines the network topology. It consists of layernames as strings, mapped on dicts, which defines the layers. The layer dict consists of keys as strings and the value type depends on the key. The layer dict should contain the key
class
which defines the class or type of the layer, such ashidden
for a feedforward layer,rec
for a recurrent layer (including LSTM) orsoftmax
for the output layer (doesnât need to have the softmax activation). Usually it also contains the keyn_out
which defines the featuredimension of the output of this layer, and the keyfrom
which defines the inputs to this layer, which is a list of other layers. If you omitfrom
, it will automatically pass in the input data from the dataset. All layer dict keys are passed to the layer class__init__
, so you have to refer to the code for all details.Example of a 3 layer bidirectional LSTM:
network = { "lstm0_fw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1 }, "lstm0_bw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1 }, "lstm1_fw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1, "from" : ["lstm0_fw", "lstm0_bw"] }, "lstm1_bw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1, "from" : ["lstm0_fw", "lstm0_bw"] }, "lstm2_fw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1, "from" : ["lstm1_fw", "lstm1_bw"] }, "lstm2_bw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1, "from" : ["lstm1_fw", "lstm1_bw"] }, "output" : { "class" : "softmax", "loss" : "ce", "from" : ["lstm2_fw", "lstm2_bw"] } }
See API or the code itself for documentation of the arguments for each layer class type. The
rec
layer class in particular supports a wide range of arguments, and several units which can be used, e.g. you can choose between different LSTM implementations, or GRU, or standard RNN, etc. SeeTFNetworkRecLayer.RecLayer
orNetworkRecurrentLayer.RecurrentUnitLayer
. See also TensorFlow LSTM Benchmark. batch_size
 The total number of frames. A minibatch has at least a timedimension
and a batchdimension (or sequencedimension), and depending on dense or sparse,
also a featuredimension.
batch_size
is the upper limit fortime * sequences
during creation of the minibatches.  max_seqs
 The maximum number of sequences in one minibatch.
 learning_rate
 The learning rate during training, e.g.
0.01
.  adam / nadam / âŠ
 E.g. set
adam = True
to enable the Adam optimization during training. See in Updater.py for many more.  model
 Defines the model file where RETURNN will save all model params after an epoch of training.
For each epoch, it will suffix the filename by the epoch number.
When running
forward
orsearch
, the specified model will be loaded. The epoch can then be selected with the paramterload_epoch
.  num_epochs
 The number of epochs to train.
 log_verbosity
 An integer. Common values are 3 or 4. Starting with 5, you will get an output per minibatch.
There are much more parameters, and more details to many of the listed ones. Details on the parameters can be found in the parameter reference. As the reference is still incomplete, please watch out for additional parameters that can be found in the code.
All configuration params can also be passed as command line parameters.
The generic form is ++param value
, but more options are available.
Please See the code for some usage.
See Technological Overview for more details and an overview how it all works.
Network StructureÂ¶
ConstructionÂ¶
The network structure which defines the model topology is defined by the config network
option,
which is a dict, where each entry is a layer specification, which itself is a dict containing
the kwargs for the specific layer class. E.g.:
network = {
"fw1": {"class": "linear", "activation": "relu", "dropout": 0.1, "n_out": 500, "from": ["data"]},
"fw2": {"class": "linear", "activation": "relu", "dropout": 0.1, "n_out": 500, "from": ["fw1"]},
"output": {"class": "softmax", "loss": "ce", "from": ["fw2"], "target": "classes"}
}
The "class"
key will get extracted from the layer arguments and the specific layer class will be used.
Some arguments are available for all layer classes, such as dropout
.
A list of all general arguments can be found below in Defining Layers.
For the layer specific arguments such as activation``for the linear layer
please have a look at the :ref:`layer_reference`.
The ``from
argument, which is also available for all layers, is a list of all input layers or datasets.
"data"
denotes the default data input.
More details on how to connect layers and datasets can be found below at Connecting Layers.
For Theano, the base layer class is NetworkBaseLayer.Container
and NetworkBaseLayer.Layer
;
for TensorFlow, it is returnn.tf.layers.base.LayerBase
.
E.g. that would use the TFNetworkLayer.LinearLayer
class,
and the LinearLayer.__init__
will accepts arguments like activation
.
In the given example, all the remaining arguments will get handled by the base layer.
The construction itself can be found for TensorFlow in returnn.tf.network.TFNetwork.construct_from_dict()
,
which starts from the output layers goes over the sources of a layer, which are defined by "from"
.
If a layer does not define "from"
, it will automatically get the input from the dataset data.
Here is a 2 layer unidirectional LSTM network:
network = {
"lstm1": {"class": "rec", "unit": "lstm", "dropout": 0.1, "n_out": 500, "from": ["data"]},
"lstm2": {"class": "rec", "unit": "lstm", "dropout": 0.1, "n_out": 500, "from": ["lstm1"]},
"output": {"class": "softmax", "loss": "ce", "from": ["lstm2"], "target": "classes"}
}
In TensorFlow, that would use the layer class TFNetworkRecLayer.RecLayer
which will handle the argument unit
.
And here is a 3 layer bidirectional LSTM network:
network = {
"lstm0_fw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1 },
"lstm0_bw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1 },
"lstm1_fw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1, "from" : ["lstm0_fw", "lstm0_bw"] },
"lstm1_bw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1, "from" : ["lstm0_fw", "lstm0_bw"] },
"lstm2_fw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1, "from" : ["lstm1_fw", "lstm1_bw"] },
"lstm2_bw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1, "from" : ["lstm1_fw", "lstm1_bw"] },
"output" : { "class" : "softmax", "loss" : "ce", "from" : ["lstm2_fw", "lstm2_bw"] }
}
Defining LayersÂ¶
Every usable layer with the TensorFlow backend inherits from returnn.tf.layers.base.LayerBase
.
This class provides most of the parameters that can be set for each layer.
Every layer accepts the following dictionary entries:
class [str
] specifies the type of the layer. Each layer class defines a layer_class
attribute which
defines the layer name.
from [list[str]
] specifies the inputs of a layer, usually refering to the layer name. Many layers automatically concatenate their inputs, as provided by
TFNetworkLayer._ConcatInputLayer
. For more details on how to connect layers, see Connecting Layers.
n_out [int
] specifies the output feature dimension, and is usually set for every layer, but the argument is not strictly required.
If n_out
is not specified or set to None
, it will try to determine the output size by a provided target
.
If a loss is given, it will set n_out
to the value provided by returnn.tf.layers.base.Loss.get_auto_output_layer_dim()
.
out_type [dict[str]
] specifies the output shape in more details. The keys are dim
and shape
.
If output
is specified, the values are used to check if the output matches the given dimension and shape. Otherwise, it
is passed to returnn.tf.layers.base.LayerBase.get_out_data_from_opts()
.
loss [str
] every layer can have its output connected to a loss function. For available loss functions,
see Loss Functions. When specifying a loss, also target
has to be set (see below). In addition, loss_scale
(defaults to 1)
and loss_opts
can be specified.
target [str
] specifies the loss target in the dataset. If the target is not part of extern_data,
but another layer in the network, add âlayer:â as prefix.
loss_scale [float
] specifies a loss scale. Before adding all losses, this factor will be used as scaling.
loss_opts [dict
] specifies additional loss arguments. For details, see the documentation of the loss functions Loss Functions
loss_only_on_non_search [bool
] specifies that the loss should not be calculated during search.
trainable [bool
] (default True
) if set to False
, the layer parameters will not be updated during training (parameter freezing).
L2 [float
] if specified, add the L2 norm of the parameters with the given factor to the total constraints.
darc1 [float
] if specified, add darc1 loss of the parameters with the given factor to the total constraints.
dropout [float
] if specified, applies dropout in the input of the layer.
spatial_smoothing [float
] if specified, add spatialsmoothing loss of the layer output with the given factor to the total constraints.
register_as_extern_data [str
] register the output of the layer as an accessable entry of extern_data.
Connecting LayersÂ¶
In most cases it is sufficient to just specify a list of layer names for the from attribute. When no input is specified,
it will automatically fallback to "data"
, which is the default inputdata of the provided dataset. Depending on the
definition of the feature
and target
keys (see Dataset.DatasetSeq
), the data can be accessed
via from["data:DATA_KEY"]
. When specifying layers inside a recurrent unit (see Recurrent Layers), two additional
input prefixes are available, base
and prev
. When trying to access layers from outside the recurrent unit, the prefix
base
as to be used. Otherwise, only other layers inside the recurrent unit are recognised. prev
can be used to access
the layer output from the previous recurrent step (e.g. for target embedding feedback).
Layer InitializationÂ¶
RETURNN offers multiple methods of initializing layers. This is usually done by setting the parameter
"forward_weights_init"
in layers that have trainable parameters.
The methods for initializations include, but are not limited to:
 providing a single value (will map to
tf.initializers.constant
) providing the (lowercase) name of a given tensorflow intializer, which can be e.g.:
"glorot_normal"
"glorot_uniform"
"orthogonal"
 providing a dictionary for the initializer classes:
 Example:
"forward_weights_init": {'class': 'VarianceScaling', 'scale': 0.5, 'mode': 'fan_out'}
The initialization is performed in TFUtil.get_initializer()
.
Note: the initalizers can be accessed both as e.g. "glorot_normal"
or "glorot_normal_initializer"
.
Managing AxesÂ¶
In the default case, the axes of data that is passed between layers (such as batch, time, spatial and feature)
are not visible to the user, and handled by RETURNN internally with the help of returnn.tf.util.data.Data
objects.
For layers that operate on specific axes, meaning they have an axis
or axes
parameter, different identifier
(strings) can be used to select the correct axes. These identifier are e.g.
Bbatch:
select the batch axisTtime:
select the time axisFfeature
select the feature axisspatial
select all spatial axes (not batch and not feature, needs dynamic length)S:<int>spatial:<int>
select a single spatial axis from the list of all spatial axes (zerobased, can be negative)dyndynamic
select all dynamic axes (all spacial axes and time even if it has no dynamic length)D:<int>dyn:<int>dynamic:<int>
select a specific dynamic axis (zerobased, can be negative)T?
select time axis if existing, none otherwisespatial_except_time
select all spatial axes but also not the time axis ``except_time```select all axes except time and batch axis
except_batch
select all axes except batch axis
Note that all identifier can be used caseinsensitive.
For axes
paramater it is also possible to provide a tuple or list of the above identifiers.
For debugging purposes it is also possible to use an intereger to directly access an axis,
but this should not be used in finished configurations.
If something is unclear, or not working as intended, please refer to
Data.get_axes_from_description()
.
Data Input/OutputÂ¶
The parameters that are used to correctly define the data inputs are the three dataset variables train
, dev
and
eval
, as well as the parameter extern_data
to define the data shapes.
The dataset variables are set to a dictionary structure,
where the key âclass
â defines which class implementation to load, and the other entries
are passed as parameters to the constructor of the respective dataset implementation.
A list of the available datasets can be found here.
A very simple example would be:
train = {'class': 'HDFDataset', 'files': ['path/to/training_data.hdf']}
Most datasets follow the convention that the input data is sequential and has the label âdata
â, and the target data
is sparse and has the label âclasses
â.
In the case of the hdf file it could be that the input data are 100dimensional MFCCs
and the target data are 5,000 word classes.
The parameter extern_data
can be used to give an explicit definition of the shapes.
All constructor parameters to returnn.tf.util.data.Data
can be provided as dictionary for each data stream.
For the above example, extern_data
could be defined as:
extern_data = {
'data': {'dim': 100, 'shape': (None, 100)},
'classes': {'dim': 5000, 'shape': (None,), 'sparse': True}
}
The None
in the âshape
â parameter tuple defines that the axis has a dynamic shape.
For sequence tasks there is usually only one dynamic axis, which is specified to be the time axis.
In the case of multiple dynamic axes or spatial axes it is helpful to define the time axis explicitely.
For the example case of two dynamic axes, the time axis could be set to be the first axis:
extern_data = {
'data': {'dim': 100, 'shape': (None, None, 100), 'time_dim_axis': 1},
[...]
}
Note that while the âshape
â parameter tuple is always defined without the batch axis,
the axis labels for the time, feature or the batch axis itself are counted including the batch axis.
This means that âtime_dim_axis: 1
â corresponds to the first None
of the âshape
â tuple.
For the general case (nonsparse data), only dim
and shape
are required, the other parameters are optional.
Using Layer Outputs as DataÂ¶
In case you want to specify data by using layers, it is possible to add register_as_extern_data
to the layer dictionary.
The provided string is the key to access the data.
It is not required to also add the key manually to the extern_data
dictionary.
Using Multiple Data InputsÂ¶
For cases where a single dataset is not sufficient, it is possible to combine multiple datasets by using the
MetaDataset.MetaDataset
.
Details on how to use the MetaDataset can be found here.
Synchronizing Dynamic AxesÂ¶
In the case that there are multiple data streams that have exactly the same length,
RETURNN does not automatically match those axis while broadcasting.
The dynamic axes of different datastreams can be synchronized by using returnn.tf.util.data.DimensionTag
.
dynamic_time_dimension = DimensionTag(name="dynamic_time")
extern_data = {
'data1': {'dim': 100, 'shape': (None, 100), 'time_dim_axis': 1, 'same_time_dim_as': {'T': dynamic_time_dimension}},
'data2': {'dim': 10, 'shape': (None, 10), 'time_dim_axis': 1, 'same_time_dim_as': {'T': dynamic_time_dimension}},
[...]
}
The parameter âsame_time_dims_as
â takes a dictionary with axes indices or axes labels (see managing_axes)
as key and the DimensionTag as value.
For the above example, there is no difference in using âTâ or 1 as key.
Recurrent SubNetworksÂ¶
For many task it will be necessary to define multiple layers that are applied as recurrent network over a sequential input,
especially when running a search over sequences.
While basic recurrent layers such as LSTM variants are defined by using the ârec
â layer and selecting the desired
âunit
â, custom subnetworks can be defined by passing a network dictionary for the âunit
â attribute.
The defined structure will then be applied for each position of the sequence.
As for the global network, an âoutput
â layer is required to define which values will be the output of the subnet.
The layer outputs of the previous timesteps can be accessed by adding the prefix âprev:
â to the layer names.
Static data from outside the subnet can be accessed via the layer prefix âbase:
â.
Example of a recurrent âreluâ layer:
{
"class": "rec",
"from": ["input"],
"unit": {
# Recurrent subnet here, operate on a single timestep:
"output": {
"class": "linear",
"from": ["prev:output", "data:source"],
"activation": "relu",
"n_out": n_out},
},
"n_out": n_out,
}
Layers with recurrent dependencies and hidden states (e.g. LSTMs) can be added as ârnn_cell
â layer.
For available cell units see here.
The number of steps is determined by the time axis of the input.
If multiple inputs are given, they will be concatenated on the feature axis.
Currently there is no support to access two layer outputs directly.
The concatenated data can be split by using a :class`SliceLayer <returnn.tf.layers.basic.SliceLayer>` on the feature axis.
Recurrent net with independent step count:
If the number of steps in the recurrent net should be determined by a condition Example of an MLPstyle attention mechanism with an LSTM layer:
{
"class": "rec",
"from": [],
"unit": {
"state_transformed": {"class": "linear", "activation": None, "with_bias": False, "from": ["output"], "n_out": 128},
"energy_in": {"class": "combine", "kind": "add", "from": ["base:enc_ctx", "s_transformed"], "n_out": 128},
"energy_tanh": {"class": "activation", "activation": "tanh", "from": ["energy_in"]},
"energy": {"class": "linear", "activation": None, "with_bias": False, "from": ["energy_tanh"], "n_out": 128},
"att_weights": {"class": "softmax_over_spatial", "from": ["energy"]}, # (B, encT, H)
"att": {"class": "generic_attention", "weights": "att_weights", "base": "base:enc_value"}, # (B, H, V)
"decoder": {"class": "rnn_cell", "unit": "LSTMBlock", "from": ["prev:att"], "n_out": 256, 'target': 'data'},
'stop_token': {'class': 'linear', 'activation': None, 'n_out': 1, 'loss': 'bin_ce',
'target': 'stop_token_target', 'from': ['output']},
'stop_token_sigmoid': {'class': 'activation', 'activation': 'sigmoid', 'from': ['stop_token']},
"end": {"class": "compare", "from": ["output"], "value": 0}
"output_prob": {"class": "softmax", "from": ["decoder"], "target": target, "loss": "ce"}
}
"n_out": n_out
}
The from
attribute can be empty when using the output as a target.
The sequence length will then be determined by this target.
Using Multiple Outputs
Besides the default output
layer, additional layers can be flaged as output layer.
When adding the parameter is_output_layer
and setting it to True
,
the output of a sublayer can be accessed by using the pattern ârecurrent_layer/sublayer
â.
RETURNN as FrameworkÂ¶
Install RETURNN via pip
(PyPI entry).
Then import returnn
should work.
See demoreturnnasframework.py as a full example.
Basically you can write very high level code like this:
from returnn.TFEngine import Engine
from returnn.Dataset import init_dataset
from returnn.Config import get_global_config
config = get_global_config(auto_create=True)
config.update(dict(
# ...
))
engine = Engine(config)
train_data = init_dataset({"class": "Task12AXDataset", "num_seqs": 1000, "name": "train"})
dev_data = init_dataset({"class": "Task12AXDataset", "num_seqs": 100, "name": "dev", "fixed_random_seed": 1})
engine.init_train_from_config(train_data=train_data, dev_data=dev_data)
Or you go lower level and construct the computation graph yourself:
from returnn.TFNetwork import TFNetwork
config = get_global_config(auto_create=True)
net = TFNetwork(train_flag=True)
net.construct_from_dict({
# ...
})
fetches = net.get_fetches_dict()
with tf.compat.v1.Session() as session:
results = session.run(fetches, feed_dict={
# ...
# you could use FeedDictDataProvider
})
Or even lower level and just use parts from TFUtil
, TFNativeOp
, etc.:
from returnn.TFNativeOp import ctc_loss
from returnn.TFNativeOp import edit_distance
from returnn.TFNativeOp import NativeLstm2
from returnn.TFUtil import ctc_greedy_decode
from returnn.TFUtil import get_available_gpu_min_compute_capability
from returnn.TFUtil import safe_log
from returnn.TFUtil import reuse_name_scope
from returnn.TFUtil import dimshuffle
# ...
Frequently Asked QuestionsÂ¶
TensorFlow LSTM BenchmarkÂ¶
There are multiple LSTM implementations/kernels available in TensorFlow, and we also have our own kernel. In this benchmark, we try to compare the runtime performance during training for each of the kernels. We try to measure in a way that it should be generic and not be specific for our Returnn framework. You can run this benchmark yourself with this script.
In Returnn with the TensorFlow backend, the rec
layer (TFNetworkRecLayer.RecLayer
)
you can use these LSTM kernels via the unit
argument:
BasicLSTM
(GPU and CPU). Uses tf.contrib.rnn.BasicLSTMCell
via dynamic_rnn.
I.e. the cell itself is pure TensorFlow, and the loop over time is done via
tf.while_loop
.
StandardLSTM
(GPU and CPU). Uses tf.contrib.rnn.LSTMCell
via dynamic_rnn.
I.e. the cell itself is pure TensorFlow, and the loop over time is done via
tf.while_loop
. This has some more options compared toBasicLSTM
.
LSTMBlock
(GPU and CPU). Uses tf.contrib.rnn.LSTMBlockCell
via dynamic_rnn.
The timestep operation is implemented as a single TF operation,
and the loop over time is done via
tf.while_loop
. Thus this should be faster thanBasicLSTM
andStandardLSTM
.
LSTMBlockFused
(GPU and CPU). Uses tf.contrib.rnn.LSTMBlockFusedCell.
The loop over time is part of the op (âfusedâ in TF terminology),
thus this is like
NativeLSTM
andCudnnLSTM
a single op for the whole calculation. This is based onLSTMBlock
and should thus be faster thanLSTMBlock
.
CudnnLSTM
(GPU only). Uses the LSTM kernel from cuDNN,
via tf.contrib.cudnn_rnn.CudnnLSTM.
The loop over time is done internally (âfusedâ).
If you import such a model on CPU, it will automatically convert it into a
LSTMBlockFused
.
NativeLSTM
(GPU and CPU). Uses our own CUDA kernel which can also be compiled on CPU.
The loop over time is also done via C++ code inside the op (âfusedâ).
See
TFNativeOp
.
If you just use LSTM
, it will currently use NativeLSTM
by default.
Except of NativeLSTM
, all of these kernels are part of the official TensorFlow framework.
Note that these kernels are always use for a single direction in time and a single layer.
The cuDNN LSTM kernel can also work bidirectional and do multiple layers at once
but tf.contrib.cudnn_rnn.CudnnLSTM
currently does not support batches with sequences of different length,
thus this is normally not an option to use.
Note that most frameworks with cuDNN bindings do not support this correctly
(see here),
where CNTK is currently the only exception.
In TensorFlow, this is issue 6633.
Note that you still can use the cuDNN kernel in the way we do in Returnn,
i.e. for a single layer in one timedirection.
For the benchmark, we build a multilayer bidirectional network. Example of a 3 layer bidirectional LSTM:
network = {
"lstm1_fwd" : { "class": "rec", "unit": "lstm", "n_out" : 500, "direction": 1 },
"lstm1_bwd" : { "class": "rec", "unit": "lstm", "n_out" : 500, "direction": 1 },
"lstm2_fwd" : { "class": "rec", "unit": "lstm", "n_out" : 500, "direction": 1, "from" : ["lstm1_fwd", "lstm1_bwd"] },
"lstm2_bwd" : { "class": "rec", "unit": "lstm", "n_out" : 500, "direction": 1, "from" : ["lstm1_fwd", "lstm1_bwd"] },
"lstm3_fwd" : { "class": "rec", "unit": "lstm", "n_out" : 500, "direction": 1, "from" : ["lstm2_fwd", "lstm2_bwd"] },
"lstm3_bwd" : { "class": "rec", "unit": "lstm", "n_out" : 500, "direction": 1, "from" : ["lstm2_fwd", "lstm2_bwd"] },
"output" : { "class" : "softmax", "loss" : "ce", "from" : ["lstm3_fwd", "lstm3_bwd"] }
}
We use framewise cross entropy as a loss for training,
and we use a very simple artificial dataset (GeneratingDataset.Task12AXDataset
)
with dense input with a very low number of dimensions (9)
and single output class indices (sparse) with a very low number of class labels (2),
so that the overhead of the final softmax layer should be minimal, as well as the whole input pipeline.
We are not interested in the error performance on this task in this benchmark,
as in theory the results should all be the same.
In practice, they are not due to different implementations,
and also the initialization is currently not the same in all cases.
However, that has no effect on the runtime performance.
By default, we use chunking, i.e. not the full sequences but only slices of it of fixed size (50 frames), to reduce the amount of padding in a minibatch and also to keep the maximum sequence length of a batch fixed, and also to be able to increase the amount of sequences in a batch to allow more parallelism (40 sequences). See our paper for more details about chunking. Thus, our minibatch has in total 2000 frames.
ComparisonÂ¶
For a 5 layer bidirectional LSTM with dimension 500 in each time direction, on a GeForce GTX 980, using 8 CPU threads, we got these results:
GPU:CudnnLSTM: 0:00:08.8151
GPU:NativeLSTM: 0:00:08.8440
GPU:LSTMBlockFused: 0:00:16.9765
GPU:LSTMBlock: 0:00:33.4895
GPU:StandardLSTM: 0:00:39.5170
GPU:BasicLSTM: 0:00:41.7282
CPU:NativeLSTM: 0:04:05.4365
CPU:LSTMBlockFused: 0:04:35.1702
CPU:StandardLSTM: 0:04:57.7977
CPU:BasicLSTM: 0:05:00.5334
CPU:LSTMBlock: 0:05:07.5613
On a GeForce GTX 1080 Ti, using 8 CPU threads, for the same experiment we got:
GPU:NativeLSTM: 0:00:05.2728
GPU:CudnnLSTM: 0:00:05.3645
GPU:LSTMBlockFused: 0:00:09.3915
GPU:LSTMBlock: 0:00:15.3071
GPU:StandardLSTM: 0:00:17.8279
GPU:BasicLSTM: 0:00:22.3976
CPU:NativeLSTM: 0:05:09.6268
CPU:LSTMBlockFused: 0:07:45.5984
CPU:StandardLSTM: 0:08:02.5465
CPU:BasicLSTM: 0:08:16.3543
CPU:LSTMBlock: 0:08:18.1589
And on a GeForce GTX 1070, with 4 CPU threads, we got:
GPU:NativeLSTM: 0:00:03.9989
GPU:CudnnLSTM: 0:00:05.4496
GPU:LSTMBlockFused: 0:00:07.5233
GPU:LSTMBlock: 0:00:11.1515
GPU:StandardLSTM: 0:00:12.0605
GPU:BasicLSTM: 0:00:12.0833
CPU:LSTMBlockFused: 0:02:53.6482
CPU:BasicLSTM: 0:03:00.8289
CPU:StandardLSTM: 0:03:01.6320
CPU:LSTMBlock: 0:03:04.8836
CPU:NativeLSTM: 0:03:18.5375
On a CPUonly system with a single CPU thread, we got:
CPU:NativeLSTM: 0:15:55.7625
CPU:LSTMBlockFused: 0:24:53.1451
CPU:BasicLSTM: 0:26:28.2804
CPU:StandardLSTM: 0:27:10.0493
CPU:LSTMBlock: 0:27:58.8870
Each of those are executed on different hardware, so there might be small other differences due to that.
Also the number of available CPU threads differs.
Each of those were run on Ubuntu 16.04 with TensorFlow 1.2 (installed via pip
), CUDA 8.0 and cuDNN 5.1.
Analysis and DiscussionÂ¶
We are quite proud that our own LSTM kernel (NativeLSTM
)
has a similar runtime than the cuDNN LSTM kernel (CudnnLSTM
),
sometimes even better.
The implementation of it is quite straightforward.
As expected, on GPU, both NativeLSTM
and CudnnLSTM
are faster than LSTMBlockFused
(sometimes twice as fast).
Also as expected, on GPU, LSTMBlockFused
is faster than LSTMBlock
(up to 50%).
On GPU, LSTMBlock
seems slightly faster than BasicLSTM
/StandardLSTM
but the difference is not so big.
Interestingly, on all experiments, on GPU, StandardLSTM
seems to be slightly faster than BasicLSTM
,
which is not expected, as the BasicLSTM
is simpler and also recommended by TensorFlow
if you donât need the extended options which are available for StandardLSTM
.
On CPU, it again looks different, and not as clear.
This depends also on how much CPU threads will be used, and on the hardware.
For example, NativeLSTM
is currently not well optimized to use multiple threads (intra op parallelism).
See also TFUtil.setup_tf_thread_pools()
about intra and inter op parallelism.
We see that with a very low number of threads, on CPU, NativeLSTM
can be the fastest, but not necessarily.
Increasing the number of threads, NativeLSTM
can become the slowest.
On CPU, LSTMBlockFused
seems to be the fastest despite NativeLSTM
, no matter the number of threads.
On CPU, interestingly, BasicLSTM
and StandardLSTM
seem to be slightly faster than LSTMBlock
.
Configuration ParametersÂ¶
Warning
The configuration reference is currently under construction and incomplete, for more options look into examples and the code itself
General SettingsÂ¶
 dev
 A dictionary specifying the developement set. For details on datasets, see Datasets
 device
 E.g.
gpu
orcpu
. Although RETURNN will automatically detect and use a GPU if available, a specific device can be enforced by setting this parameter.  extern_data (former num_outputs)
Defines the source/target dimensions of the data. Both can be integers. extern_data can also be a dict if your dataset has other data streams. The standard source data is called â
data
â by default, and the standard target data is called âclasses
â by default. You can also specify whether your data is dense or sparse (i.e. it is just the index), which is specified by the number of dimensions, i.e. 2 (timedim + featuredim) or 1 (just timedim). When using no explicit definition, it is assumed that the data contains a time axis.Example:
extern_data = {"data": [100, 2], "classes": [5000, 1]}
. This defines an input dimension of 100, and the input is dense (2), and an output dimension of 5000, and the output provided by the dataset is sparse (1).For a more explicit definition of the shapes, you can provide a dict instead of a list or tuple. This dict may contain information to create âDataâ objects. For extern_data, only
dim
andshape
are required. Example:'feature_data': {'dim': 80, 'shape': (None, 80)}
This defines 80 dimensional features with a time axis of arbitrary length. Example:'speaker_classes': {'dim': 1172, 'shape': (), 'sparse': True}
This defines a sparse input for e.g. speaker classes that do not have a time axis.In general, all input parameters to
returnn.tf.util.data.Data
can be provided log
 path to the log, or list of paths for multiple logs.
 log_batch_size
 If set to
True
, for each batch the number of sequences and maximal sequence length is displayed  log_verbosity
 An integer or list of integer. Common values are 3 or 4. Starting with 5, you will get an output per minibatch. If a list is proved for logs, log_verbosity can be specified for each log.
 model
 Defines the model file where RETURNN will save all model params after an epoch of training.
For each epoch, it will suffix the filename by the epoch number.
If
load_from
is not set, the model will also be loaded from this path.  network
 This is a nested dict which defines the network topology.
It consists of layernames as strings, mapped on dicts, which defines the layers.
The layer dict consists of keys as strings and the value type depends on the key.
The layer dict should contain the key
class
which defines the class or type of the layer, such aslinear
for a feedforward layer,rec
for a recurrent layer (including LSTM) orsoftmax
for the output layer (doesnât need to have the softmax activation). Usually it also contains the keyn_out
which defines the featuredimension of the output of this layer, and the keyfrom
which defines the inputs to this layer, which is a list of other layers. For details sett Layers / Modules.  num_inputs
 Input feature dimension of the network, related to the âdataâ tag.
Deprecated for the TensorFlow backend, see
extern_data
 num_outputs
 Output feature dimension of the network, related to the âclassesâ tag.
Deprecated for the TensorFlow backend, see
extern_data
 task
 The task to run. Common cases are
train
,forward
orsearch
.  tf_log_memory_usage
 If set to
True
, will display the current GPU memory usage when using the tensorflow backend.  tf_log_dir
Defines the folder where the tensorflow/tensorboard logs are writting. Per default, the logs are written next to the models. .. note:
has to be set specifically when loading a model from a folder without write permission
 train
 A dictionary specifying the training dataset. For details on datasets, see Datasets
 use_tensorflow
 If you set this to
True
, TensorFlow will be used.
TrainingÂ¶
 batch_size
 An integer defining the batch size in data items (frames, words, subwords, etc.) per batch.
A minibatch has at least a timedimension and a batchdimension (or sequencedimension),
and depending on dense or sparse, also a featuredimension.
batch_size
is the upper limit fortime * sequences
during creation of the minibatches.  batching
Defines the default value for
seq_ordering
across all datasets. It is recommended to not use this parameter, but rather defineseq_ordering
explicitely in the datasets for better readability. Possible values are:default
: Keep the sequences as isreverse
: Use the default sequences in reversed orderrandom
: Shuffle the data with a predefined fixed seedrandom:<seed>
: Shuffle the data with the seed givensorted
: Sort by length (only if available), beginning with shortest sequencessorted_reverse
: Sort by length, beginning with longest sequenceslaplace:<n_buckets>
: Sort by length with n laplacian buckets (one bucket means going from shortest to longest and back with 1/n of the data).laplace:.<n_sequences>
: sort by length with n sequences per laplacian bucket.
Note that not all sequence order modes are available for all datasets, and some datasets may provide additional modes.
 chunking
 You can chunk sequences of your data into parts, which will greatly reduce the amount of needed zeropadding.
This option is a string of two numbers, separated by a comma, i.e.
chunk_size:chunk_step
, wherechunk_size
is the size of a chunk, andchunk_step
is the step after which we create the next chunk. I.e. the chunks will overlap bychunk_size  chunk_step
frames. Set this to0
to disable it, or for example100:75
to enable it.  cleanup_old_models
If set to
True
, checkpoints are removed based on their score on the dev set. Per default, 2 recent, 4 best, and the checkpoints 20,40,80,160,240 are kept. Can be set as a dictionary to specify additional options.keep_last_n
: integer defining how many recent checkpoints to keepkeep_best_n
: integer defining how many best checkpoints to keepkeep
: list or set of integers defining which checkpoints to keep
 max_seq_length
 A dict with string:integer pairs. The string must be a valid data key,
and the integer specifies the upper bound for this data object.
During batch construction any sequence where the specified data object exceeds the upper bound are discarded.
Note that some datasets (e.g
OggZipDataset
) load and process the data to determine the length, so even for discarded sequences data processing might be performed.  max_seqs
 An integer specifying the upper limit of sequences in a batch (can be used in addition to
batch_size
).  num_epochs
 An integer specifying the number of epochs to train.
 save_interval
 An integer specifying after how many epochs the model is saved.
 start_epoch
 An integer or string specifying the epoch to start the training at. The default is âautoâ.
 stop_on_nonfinite_train_score
 If set to
False
, the training will not be interupted if a single update step has a loss with NaN of Inf
Optimizer SettingsÂ¶
Note
To define the update algorithm, set the parameter optimizer
to a dictionary
and define the type by setting class
.
All available optimizers and their parameters can be found here.
Setting the learning rate should not set in the dict, but rather separately.
If no updater is specified, plain SGD is used.
The learning rate control scheme is set with learning_rate_control
,
and many possible settings are available for the different control schemes.
For the default values have a look at LearningRateControl.py.
Warning
RETURNN will override the optimizer epsilon with 1e16 if not specified otherwise, this can lead to unwanted behaviour when assuming a default epsilon of e.g. 1e8 for Adam.
 accum_grad_multiple_step
 An integer specifying the number of updates to stack the gradient, called âgradient accumulationâ.
 gradient_clip
 Specifiy a gradient clipping threshold.
 gradient_noise
 Apply a (gaussian?) noise to the gradient with given deviation (variance? stddev?)
 learning_rate
 Specifies the global learning rate
 learning_rates
 A list of learning rates that defines the learning rate for each epoch from the beginning. Can be used for learningrate warmup.
 learning_rate_control
This defines which type of learning rate control mechanism is used. Possible values are:
constant
for a constant learning rate which is never modifiednewbob_abs
for a scheduling based on absolute improvementnewbob_rel
for a scheduling based on relative improvementnewbob_multi_epoch
for a scheduling based on relative improvement averaged over multiple epochs
Please also look at setting values with the
newbob
prefix for further customization learning_rate_control_error_measure
 A str to define which score or error is used to control the learning rate reduction. Per default, Returnn will use dev_score_output. A typical choice would be dev_score_LAYERNAME or dev_error_LAYERNAME. Can be set to None to disable learning rate control.
 learning_rate_control_min_num_epochs_per_new_lr
 The number of epochs after the last update that the learning rate is kept constant.
 learning_rate_control_relative_error_relative_lr
 If true, the relative error is scaled with the ratio of the default learning rate divided by the current
learning rate.
Can be used with
newbob_rel
andnewbob_multi_epoch
.  learning_rate_file
 A path to a file storing the learning rate for each epoch. Despite the name, also stores scores and errors.
 min_learning_rate
 Specifies the minimum learning rate.
 newbob_error_threshold
 This is the absolute improvement that has to be achieved in order to _not_ reduce the learning rate.
Can be used with
newbob_abs
. The value can be positive or negative.  newbob_learning_rate_decay
 The scaling factor for the learning rate when a reduction is applied.
This parameter is available for all
newbob
variants.  newbob_multi_num_epochs
 The number of epochs the improvement is averaged over.
 newbob_multi_update_interval
 The number of steps after which the learning rate is updated.
This is set equal to
newbob_multi_num_epochs
when not specified.  newbob_relative_error_threshold
 This is the relative improvement that has to be achieved in order to _not_ reduce the learning rate.
Can be used with
newbob_rel
andnewbob_multi_epoch
. The value can be positive or negative.  optimizer
 A dictionary with a
class
entry for the optimizer. Other keys are passed as parameters to the constructor of the optimizer class.  relative_error_div_by_old
 If true the relative error is computed by dividing the error difference by the old error value instead of the current error value.
 reset_updater_vars_mod_step
 The number of epochs after which the internal states of all optimizers will be resetted to their initial state.
 use_learning_rate_control_always
 If true, use the learning rate control scheme also during pretraining.
PretrainingÂ¶
The parameter to enable the pretraining is called pretrain and needs to be set to a dictionary. The dictionary usually contains:
 construction_algo
 this needs to be a function that has the signature (idx, net_dict) > net_dict and should return the transformed network structure for pretraining step âidxâ.
 repetitions
 this number defines how many epochs should be run during each pretraining step. If not specified this will be 1.
Model LoadingÂ¶
Note
This documentation does not cover all possible combinations of parameters for loading models. For more details, please refer to EngineBase.
 import_model_train_epoch1
 If a path to a valid model is provided
(for TF models paths with or without
.meta
or.index
extension are possible), use this to initialize the weights for training. If you do not want to start a new training, seeload
.  load
 If a path to a valid model is provided
(for TF models paths with or without
.meta
or.index
extension are possible), use this to load the specified model and training state. The training is continued from the last position.  load_epoch
 Specifies the epoch index, and selects the checkpoint based on the prefix given in
model
. If not set, RETURNN will determine the epoch from the filename or use the latest epoch in case of providing onlymodel
.  preload_from_files
A dictionary that should contain
filename
andprefix
. For all layers in your network whose layer name starts with prefix, it will load the parameters from the checkpoint specified by filename (It will look for the corresponding layer name without the prefix in the given checkpoint). Example (without using a specific prefix):preload_from_files = { "existingmodel": { "init_for_train": True, "ignore_missing": True, # if the checkpoint only partly covers your model "filename": ".../netmodel/network.163", # your checkpoint file } }
Generation and SearchÂ¶
Note
There is no beam_size
parameter for the config, as beam_size
is a parameter for the choice
layer.
For further details, see TFNetworkRecLayer.ChoiceLayer
 forward_override_hdf_output
 Per default, Returnn will give an error when trying to overwrite an existing output. If this flag is set to true, the check is disabled.
 output_file
 When the task is âforwardâ, specifies the output path for the resulting hdf. If not specified, the name will be âdumpfwdepoch%i.hdfâ % epoch.
 search_output_layer
 TODOâŠ
 search_output_file
 Defines where the search output is written to.
 search_output_file_format
 The supported file formats are txt and py.
DebuggingÂ¶
 debug_add_check_numerics_on_output
 If set to
True
, should assert forinf
andnan
.  debug_grad_summaries
 If set to
True
, adds additional information about the gradients to the TensorBoard.  debug_print_layer_output_template
 If set to
True
, print the layer template information during network construction.  debug_print_layer_output_shape
 If set to
True
, print the layer shape information while the graph is executed.  debug_objective_loss_summaries
 If set to
True
, adds the objective loss values (normalization only when activated, including loss scaling) to the TensorBoard.  debug_unnormalized_loss_summaries
 If set to
True
, adds the unnormalized loss values to the TensorBoard
Also see Debugging.
DatasetsÂ¶
All datasets in RETURNN are based on the Datset.Dataset
,
and most are also based on either CachedDataset.CachedDataset
or CachedDataset2.CachedDataset2
.
The common parameters that can be used across most datasets are:
partition_epoch
: split the data into smaller parts per epochseq_ordering
: define the sequence ordering of the data.
Possible values for the sequence ordering are:
default
: Keep the sequences as isreverse
: Use the default sequences in reversed orderrandom
: Shuffle the data with a predefined fixed seedrandom:<seed>
: Shuffle the data with the seed givensorted
: Sort by length (only if available), beginning with shortest sequencessorted_reverse
: Sort by length, beginning with longest sequenceslaplace:<n_buckets>
: Shuffle the data and sort by length within each of n bins, each second bin is sorted in reverse.laplace:.<n_sequences>
: As above, but the number of bins is chosen such that each bin contains roughly n sequences.laplace:<n_buckets>:<seed>
: A seed can be provided for both laplace versions, separated by an additional colon.
Note that not all sequence order modes are available for all datasets,
and some datasets may provide additional modes.
For details on the different sequence orderings, have a look at Dataset.Dataset.get_seq_order_for_epoch()
.
Also check the sequence ordering possibilities with the MetaDataset.
Generic DatasetsÂ¶
HDF DatasetÂ¶

class
HDFDataset.
HDFDataset
(files=None, use_cache_manager=False, **kwargs)[source]Â¶ Bases:
returnn.datasets.cached.CachedDataset
Dataset based on HDF files. This was the main original dataset format of RETURNN.
Parameters:  files (Nonelist[str]) â
 use_cache_manager (bool) â uses
Util.cf()
for files
Next Gen HDF DatasetÂ¶

class
HDFDataset.
NextGenHDFDataset
(input_stream_name, files=None, **kwargs)[source]Â¶ Bases:
returnn.datasets.cached2.CachedDataset2
Another separate dataset which uses HDF files to store the data.
Parameters:  input_stream_name (str) â
 files (Nonelist[str]) â
Text DatasetsÂ¶
Language Model DatasetÂ¶

class
LmDataset.
LmDataset
(corpus_file, orth_symbols_file=None, orth_symbols_map_file=None, orth_replace_map_file=None, word_based=False, word_end_symbol=None, seq_end_symbol='[END]', unknown_symbol='[UNKNOWN]', parse_orth_opts=None, phone_info=None, add_random_phone_seqs=0, auto_replace_unknown_symbol=False, log_auto_replace_unknown_symbols=10, log_skipped_seqs=10, error_on_invalid_seq=True, add_delayed_seq_data=False, delayed_seq_data_start_symbol='[START]', **kwargs)[source]Â¶ Bases:
returnn.datasets.cached2.CachedDataset2
Dataset useful for language modeling. It creates index sequences for either words, characters or other orthographics symbols based on a vocabulary. Can also perform internal word to phoneme conversion with a lexicon file. Reads simple txt files or bliss xml files (also gzipped).
To use the LmDataset with words or characters, either
orth_symbols_file
ororth_symbols_map_file
has to be specified (both is not possible). If words should be used, setword_based
to True.The LmDatasets also support the conversion of words to phonemes with the help of the
LmDataset.PhoneSeqGenerator
class. To enable this mode, the input parameters toLmDataset.PhoneSeqGenerator
have to be provided as dict inphone_info
. As a lexicon file has to specified in this dict,orth_symbols_file
andorth_symbols_map_file
are not used in this case.The LmDataset does not work without providing a vocabulary with any of the above mentioned ways.
After initialization, the corpus is represented by self.orths (as a list of sequences). The vocabulary is given by self.orth_symbols and self.orth_symbols_map gives the corresponding mapping from symbol to integer index (in case
phone_info
is not set).Parameters:  corpus_file (str()>strlist[str]()>list[str]) â Bliss XML or linebased txt. optionally can be gzip.
 orth_symbols_file (str()>strNone) â a text file containing a list of orthography symbols
 orth_symbols_map_file (str()>strNone) â either a list of orth symbols, each line: â<symbol> <index>â, or a pickled dictionary
 orth_replace_map_file (str()>strNone) â JSON file with replacement dict for orth symbols.
 word_based (bool) â whether to parse single words, or otherwise will be character based.
 word_end_symbol (strNone) â If provided and if word_based is False (character based modeling), token to be used to represent word ends.
 seq_end_symbol (strNone) â what to add at the end, if given. will be set as postfix=[seq_end_symbol] or postfix=[] for parse_orth_opts.
 unknown_symbol (strNone) â token to represent unknown words.
 parse_orth_opts (dict[str]None) â kwargs for parse_orthography().
 phone_info (dictNone) â A dict containing parameters including a lexicon file for
LmDataset.PhoneSeqGenerator
.  add_random_phone_seqs (int) â will add random seqs with the same len as the real seq as additional data.
 log_auto_replace_unknown_symbols (boolint) â write about autoreplacements with unknown symbol. if this is an int, it will only log the first N replacements, and then keep quiet.
 log_skipped_seqs (boolint) â write about skipped seqs to logging, due to missing lexicon entry or so. if this is an int, it will only log the first N entries, and then keep quiet.
 error_on_invalid_seq (bool) â if there is a seq we would have to skip, error.
 add_delayed_seq_data (bool) â will add another datakey âdelayedâ which will have the sequence. delayed_seq_data_start_symbol + original_sequence[:1].
 delayed_seq_data_start_symbol (str) â used for add_delayed_seq_data.
Translation DatasetÂ¶

class
LmDataset.
TranslationDataset
(path, file_postfix, source_postfix='', target_postfix='', source_only=False, search_without_reference=False, unknown_label=None, seq_list_file=None, use_cache_manager=False, **kwargs)[source]Â¶ Bases:
returnn.datasets.cached2.CachedDataset2
Based on the conventions by our team for translation datasets. It gets a directory and expects these files:
 source.dev(.gz)
 source.train(.gz)
 source.vocab.pkl
 target.dev(.gz)
 target.train(.gz)
 target.vocab.pkl
The convention is to use âdevâ and âtrainâ as
file_postfix
for the dev and train set respectively, but any file_postfix can be used. The target file and vocabulary do not have to exists when settingsource_only
. It is also automatically checked if a gzip version of the file exists.To follow the RETURNN conventions on data input and output, the source text is mapped to the âdataâ key, and the target text to the âclassesâ data key. Both are index sequences.
Parameters:  path (str) â the directory containing the files
 file_postfix (str) â e.g. âtrainâ or âdevâ. it will then search for âsource.â + postfix and âtarget.â + postfix.
 random_shuffle_epoch1 (bool) â if True, will also randomly shuffle epoch 1. see self.init_seq_order().
 source_postfix (str) â will concat this at the end of the source.
 target_postfix (str) â will concat this at the end of the target. You might want to add some sentenceend symbol.
 source_only (bool) â if targets are not available
 search_without_reference (bool) â
 unknown_label (strdict[str,str]None) â Label to replace outofvocabulary words with, e.g. â<UNK>â. If not given, will not replace unknowns but throw an error. Can also be a dict data_key > unknown_label to configure for each data key separately (default for each key is None).
 seq_list_file (str) â filename. lineseparated list of line numbers defining fixed sequence order. multiple occurrences supported, thus allows for repeating examples while loading only once.
 use_cache_manager (bool) â uses
Util.cf()
for files
Audio DatasetsÂ¶
Extern Sprint DatasetÂ¶

class
SprintDataset.
ExternSprintDataset
(sprintTrainerExecPath, sprintConfigStr, partitionEpoch=None, **kwargs)[source]Â¶ Bases:
returnn.datasets.sprint.SprintDatasetBase
This is a Dataset which you can use directly in RETURNN. You can use it to get any type of data from Sprint (RWTH ASR toolkit), e.g. you can use Sprint to do feature extraction and preprocessing.
This class is like SprintDatasetBase, except that we will start an external Sprint instance ourselves which will forward the data to us over a pipe. The Sprint subprocess will use SprintExternInterface to communicate with us.
Parameters:  sprintTrainerExecPath (strlist[str]) â
  list[str]  ()>str  list[()>str]  ()>list[str]  ()>list[()>str] sprintConfigStr (str) â via eval_shell_str
 partitionEpoch (intNone) â deprecated. use partition_epoch instead
Ogg Zip DatasetÂ¶

class
GeneratingDataset.
OggZipDataset
(path, audio, targets, targets_post_process=None, use_cache_manager=False, segment_file=None, zip_audio_files_have_name_as_prefix=True, fixed_random_seed=None, fixed_random_subset=None, epoch_wise_filter=None, **kwargs)[source]Â¶ Bases:
returnn.datasets.cached2.CachedDataset2
Generic dataset which reads a Zip file containing Ogg files for each sequence and a text document. The feature extraction settings are determined by the
audio
option, which is passed toExtractAudioFeatures
. Does also support Wav files, and might even support other file formats readable by the âsoundfileâ library (not tested). By settingaudio
ortargets
toNone
, the dataset can be used in text only or audio only mode. The content of the zip file is: a .txt file with the same name as the zipfile, containing a python list of dictionaries
 a subfolder with the same name as the zipfile, containing the audio files
The dictionaries in the .txt file must be a list of dicts, i.e. have the following structure:
[{'text': 'some utterance text', 'duration': 2.3, 'file': 'sequence0.wav'}, ...]
The dict can optionally also have the entry
'seq_name': 'arbitrary_sequence_name'
. Ifseq_name
is not included, the seq_tag will be the name of the file.duration
is mandatory, as this information is needed for the sequence sorting, however, it does not have to match the real duration in any way.Parameters:  path (strlist[str]) â filename to zip
 audio (dict[str]None) â options for
ExtractAudioFeatures
. use {} for default. None means to disable.  targets (dict[str]None) â options for
Vocabulary.create_vocab()
(e.g.BytePairEncoding
)  targets_post_process (strlist[str]((str)>str)None) â
get_post_processor_function()
, applied on orth  use_cache_manager (bool) â uses
Util.cf()
 segment_file (strNone) â .txt or .gz text file containing sequence tags that will be used as whitelist
 zip_audio_files_have_name_as_prefix (bool) â
 fixed_random_seed (intNone) â for the shuffling, e.g. for seq_ordering=ârandomâ. otherwise epoch will be used
 fixed_random_subset (floatintNone) â Value in [0,1] to specify the fraction, or integer >=1 which specifies number of seqs. If given, will use this random subset. This will be applied initially at loading time, i.e. not dependent on the epoch. It will use an internally hardcoded fixed random seed, i.e. itâs deterministic.
 epoch_wise_filter (dictNone) â see init_seq_order
Dataset CombinationÂ¶
Meta DatasetÂ¶
The MetaDataset is to be used in the case of Multimodality. Here, the datasets are expected to describe different features of the same training sequences. These features will all be available to the network at the same time.
The datasets to be combined are given via the input parameter "datasets"
.
To define which training examples from the different datasets belong together,
a "seq_list_file"
in pickle format has to be created.
It contains a list of sequence tags for each dataset (see example below).
Note, that in general each dataset type has its own tag format, e.g. for the TranslationDataset it is line<n>
,
for the SprintDataset it is <corpusname>/<recording>/<segment id>
.
Providing a sequence list can be omitted, if the set of sequence tags is the same for all datasets.
When using multiple ExternSprintDataset instances, the sprint segment file can be provided as sequence list.
In this case the MetaDataset assumes that the sequences with equal tag correspond to each other.
This e.g. works when combining TranslationDatasets if all the text files are sentence aligned.
Example of Sequence List:
{ 'sprint': [
'corpus/ted_1/1',
'corpus/ted_1/2',
'corpus/ted_1/3',
'corpus/ted_1/4',
'translation': [
'line0',
'line1',
'line2',
'line3']
}
Python dict stored in pickle file. E.g. the sequence tagged with âcorpus/ted_1/3â in the âsprintâ dataset corresponds to the sequence tagged âline2â in the âtranslationâ dataset.
Example of MetaDataset config:
train = {"class": "MetaDataset", "seq_list_file": "seq_list.pkl",
"datasets": {"sprint": train_sprint, "translation": train_translation},
"data_map": {"data": ("sprint", "data"),
"target_text_sprint": ("sprint", "orth_classes"),
"source_text": ("translation", "data"),
"target_text": ("translation", "classes")},
"seq_ordering": "random",
"partition_epoch": 2,
}
This combines a SprintDataset and a TranslationDataset.
These are defined as "train_sprint"
and "train_translation"
separately.
Note that the current implementation expects one input feature to be called âdataâ.
Sequence Sorting:
If the selected sequence order uses the length of the data (e.g. when using âsortedâ or any kind of âlaplaceâ),
a subdataset has to be specified via seq_order_control_dataset
.
The desired sorting needs to be set as parameter in this subdaset, setting seq_ordering
for the MetaDataset
will be ignored.
Combined DatasetÂ¶
The CombinedDataset is to be used in the cases of MultiTask Learning and Combination of Corpora.
Here, in general, the datasets describe different training sequences.
For each sequence, only the features of the corresponding dataset will be available.
Features of the other datasets are set to empty arrays.
The input parameter "datasets"
is the same as for the MetaDataset.
The "data_map"
is reversed to allow for several datasets mapping to the same feature.
The "default"
"seq_ordering"
is to first go through all sequences of the first dataset,
then the second and so on.
All other sequence orderings ("random"
, "sorted"
, "laplace"
, âŠ) are supported
and based on this âdefaultâ ordering.
There is a special sequence ordering "random_dataset"
, where we pick datasets at random,
while keeping the sequence order within the datasets as is.
To adjust the ratio of number of training examples from the different datasets in an epoch,
one can use "repeat_epoch"
in some of the datasets to
increase their size relative to the others.
Also, "partition_epoch"
in some of the datasets can be used to shrink them relative to the others.
Example of CombinedDataset config:
train = {"class": "CombinedDataset",
"datasets": {"sprint": train_sprint, "translation": train_translation},
"data_map": {("sprint", "data"): "data",
("sprint", "orth_classes"): "orth_classes",
("translation", "data"): "source_text",
("translation", "classes"): "orth_classes"},
"seq_ordering": "default",
"partition_epoch": 2,
}
This combines a SprintDataset and a TranslationDataset.
These are defined as "train_sprint"
and "train_translation"
separately.
Note that the current implementation expects one input feature to be called âdataâ.
Note: The mapping has been inverted. We now expect (datasetkey, datasetdatakey) > selfdatakey amdataset:data > amdata, amdataset:classes > amclasses, lmdataset:data > lmdata. For each sequence idx, it will select one of the given datasets, fill in the datakeys of this dataset and will return empty sequences for the remaining datasets. The default sequence ordering is to first go through all sequences of dataset 1, then dataset 2 and so on. If seq_ordering is set to ârandom_datasetâ, we always pick one of the datasets at random (equally distributed over the sum of numseqs), but still go through the sequences of a particular dataset in the order defined for it in the config (in order if not defined). For âsortedâ or âlaplaceâ the sequence length as provided by the datasets is used to sort all sequences jointly. Note, that this overrides the sequence order of the subdatasets (also the case for ârandomâ). âpartition_epochâ of the CombinedDataset is applied to the joint sequence order for all sequences. âpartition_epochâ of the subdatasets is still applied. This can be used to adjust the relative size of the datasets. (However, do not combine âpartition_epochâ on both levels, as this leads to an unexpected selection of sequences.) To upscale a dataset, rather than downscaling the others via âpartition_epochâ, use the ârepeat_epochâ option.
Also see MetaDataset
.

class
Dataset.
Dataset
(name=None, window=1, context_window=None, chunking=None, seq_ordering='default', random_seed_offset=None, partition_epoch=None, repeat_epoch=None, seq_list_filter_file=None, unique_seq_tags=False, seq_order_seq_lens_file=None, shuffle_frames_of_nseqs=0, min_chunk_size=0, chunking_variance=0, estimated_num_seqs=None)[source]Â¶ Bases:
object
Base class for any dataset. This defines the dataset API.
Parameters:  name (str) â e.g. âtrainâ or âevalâ
 window (int) â features will be of dimension window * feature_dim, as we add a contextwindow around. not all datasets support this option.
 context_window (NoneintdictNumbersDict(dict,dict)) â will add this context for each chunk
 chunking (Nonestrint(int,int)dict(dict,dict)) â âchunk_size:chunk_stepâ
 seq_ordering (str) â âbatchingâoption in config. e.g. âdefaultâ, âsortedâ or ârandomâ. See self.get_seq_order_for_epoch() for more details.
 random_seed_offset (intNone) â
 partition_epoch (intNone) â
 repeat_epoch (intNone) â Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
 seq_list_filter_file (strNone) â defines a subset of sequences (by tag) to use
 unique_seq_tags (bool) â uniquify seqs with same seq tags in seq order
 seq_order_seq_lens_file (strNone) â for seq order, use the seq length given by this file
 shuffle_frames_of_nseqs (int) â shuffles the frames. not always supported
 estimated_num_seqs (Noneint) â for progress reporting in case the real num_seqs is unknown

class
CachedDataset.
CachedDataset
(cache_byte_size=0, **kwargs)[source]Â¶ Bases:
returnn.datasets.basic.Dataset
Parameters: cache_byte_size (int) â

class
CachedDataset2.
CachedDataset2
(**kwargs)[source]Â¶ Bases:
returnn.datasets.basic.Dataset
Somewhat like CachedDataset, but different. Simpler in some sense. And more generic. Caching might be worse.
If you derive from this class:  you must override _collect_single_seq  you must set num_inputs (densedim of âdataâ key) and num_outputs (dict key > dim, ndim1)  you should set labels  handle seq ordering by overriding init_seq_order  you can set _estimated_num_seqs  you can set _num_seqs or _num_timesteps if you know them in advance
Layers / ModulesÂ¶
Basic LayersÂ¶
Accumulate Mean LayerÂ¶

class
returnn.tf.layers.basic.
AccumulateMeanLayer
(exp_average, axes='bt', initial_value=None, is_prob_distribution=None, **kwargs)[source]Â¶ Accumulates the mean of the input (in training) (over batchdim and timedim by default). Itâs similar to
ReduceLayer
Parameters:  exp_average (float) â momentum in exponential average calculation
 axes (intlist[str]str) â the axes to reduce. must contain batch and time.
 initial_value (float) â how to initialize the variable which accumulates the mean
 is_prob_distribution (bool) â if provided, better default for initial_value
Activation LayerÂ¶

class
returnn.tf.layers.basic.
ActivationLayer
(activation, **kwargs)[source]Â¶ This layer just applies an activation function. See
TFUtil.get_activation_function()
about supported functions. Also seeEvalLayer
andCombineLayer
for similar layers.Parameters: activation (str) â e.g. âreluâ, âtanhâ, etc
Combine LayerÂ¶

class
returnn.tf.layers.basic.
CombineLayer
(kind, sources, activation=None, with_bias=False, eval=None, eval_locals=None, eval_for_output_loss=False, **kwargs)[source]Â¶ Applies a binary operation, such as addition, to all sources while accumulating the partial results. In the first step, the binary operation is performed on the first two sources. After the first step, the previous results is always the lefthand operator.
Its basic working is similar to the reduce function used in functional programming. Also see
ActivationLayer
, orCompareLayer
.Parameters:  kind (str) â currently accepted values are average, add, sub, mul, or eval
 sources (list[LayerBase]) â
 activation (strNone) â if provided, activation function to apply, e.g. âtanhâ or âreluâ
 with_bias (bool) â if given, will add a trainable bias tensor
 eval (strcallable) â for kind=âevalâ, will eval this string. or function. see
_op_kind_eval()
 eval_locals (dict[str]None) â locals for eval
 eval_for_output_loss (bool) â will do the same eval on layer.output_loss
Compare LayerÂ¶

class
returnn.tf.layers.basic.
CompareLayer
(kind='equal', value=None, **kwargs)[source]Â¶ Compares elementwise the tokens of all input sequences among themselves and/or with a specified given value. The comparisons are performed in a chain according to the order in which they are listed.
Example:
{"class": "compare", "from": ["i1", "i2"], "value": val, "kind": "less"}
computes i1 < i2 < val and it is true only if the whole chain of operations is true. The final result is the logical âandâ of all comparisons. Note that value is the last element to be compared to.
A common example usage is the end layer in a rec subnetwork to specify the stopping criterion, e.g. the last generated token is equal to the endofsentence token:
"output": {"class": "rec", "from": [], "unit": { . . . "end": {"class": "compare", "from": "output", "value": end_of_sentence_id} }, "target": "classes0"}
Parameters:  kind (str) â which comparison operation to use, e.g. âequalâ, âgreaterâ, âlessâ or other supported TF comparison ops
 value (floatintNone) â if specified, will also compare to this
Constant LayerÂ¶

class
returnn.tf.layers.basic.
ConstantLayer
(sources, value=0.0, dtype=None, with_batch_dim=False, **kwargs)[source]Â¶ Output is a constant value.
Parameters:  sources (list[LayerBase]) â
 value (intfloatbool) â
 dtype (strNone) â
 with_batch_dim (bool) â

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â will modify inplace
 network (returnn.tf.network.TFNetwork) â
 > LayerBase) get_layer (((str)) â function to get or construct another layer
Convolution LayerÂ¶

class
returnn.tf.layers.basic.
ConvLayer
(n_out, filter_size, padding, strides=1, dilation_rate=1, input_expand_dims=0, input_add_feature_dim=False, input_split_feature_dim=None, auto_use_channel_first=False, with_bias=False, activation=None, forward_weights_init='glorot_uniform', bias_init=0.0, **kwargs)[source]Â¶ A generic convolution layer which supports 1D, 2D and 3D convolution. Pooling can be done in the separate âpoolâ layer.
Parameters:  n_out (int) â number of outgoing features
 filter_size (tuple[int]) â (width,), (height,width) or (depth,height,width) for 1D/2D/3D conv. the input data ndim must match, or you can add dimensions via input_expand_dims or input_add_feature_dim. it will automatically swap the batchdim to the first axis of the input data.
 padding (str) â âsameâ or âvalidâ
 strides (inttuple[int]) â strides for the spatial dims, i.e. length of this tuple should be the same as filter_size, or a single int.
 dilation_rate (inttuple[int]) â dilation for the spatial dims
 input_expand_dims (int) â number of dynamic dims to add to the input
 input_add_feature_dim (bool) â will add a dim at the end and use inputfeaturedim == 1, and use the original input featuredim as a spatial dim.
 auto_use_channel_first (bool) â convert the input to NCHW or not
 input_split_feature_dim (Noneint) â if set, like input_add_feature_dim it will add a new feature dim which is of value input_split_feature_dim, and the original input feature dim will be divided by input_split_feature_dim, thus it must be a multiple of that value.
 with_bias (bool) â if True, will add a bias to the output features
 activation (Nonestr) â if set, will apply this function at the end

classmethod
calc_out_dim
(in_dim, filter_size, stride, padding, dilation_rate=1)[source]Â¶ Parameters:  in_dim (inttf.TensorT) â dimension in some axis
 filter_size (int) â e.g. 2, for the corresponding axis
 stride (int) â e.g. 1, for the corresponding axis
 dilation_rate (int) â e.g. 1
 padding (str) â âvalidâ or âsameâ
Returns: the output dimension
Return type: T
Copy LayerÂ¶

class
returnn.tf.layers.basic.
CopyLayer
(extra_deps=(), **kwargs)[source]Â¶ This layer does nothing, it copies its input. If multiple sources are provided, they are concatenated in the featuredim.
Parameters: extra_deps (list[LayerBase]) â Just add as an additional dependency, without really using it. This can have an effect though on the search beam, via SelectSearchSourcesLayer
. We only have this here for theCopyLayer
because theget_out_data_from_opts()
must know about it and define the right beam. Also see the optioncollocate_with
, which is different in that it does not add a dependency.
classmethod
get_out_data_from_opts
(name, sources=(), extra_deps=(), out_type=None, n_out=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]Â¶ Parameters: Return type:

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â will modify inplace
 network (returnn.tf.network.TFNetwork) â
 > LayerBase) get_layer (((str)) â function to get or construct another layer

classmethod
Cumulative Sum LayerÂ¶

class
returnn.tf.layers.basic.
CumsumLayer
(axis='T', additional_left_summand_per_element=None, reverse=False, **kwargs)[source]Â¶ Basically wraps tf.cumsum. Also supports that in the RecLayer.
Parameters:  axis (str) â see
Data.get_axis_from_description()
 additional_left_summand_per_element (strintfloatNone) â the order matters for tf.string
 reverse (bool) â
 axis (str) â see
Dot LayerÂ¶

class
returnn.tf.layers.basic.
DotLayer
(red1=1, red2=2, var1=2, var2=1, add_var2_if_empty=True, debug=False, **kwargs)[source]Â¶ This performs a dotproduct of two sources. The underlying matmul expects shapes (sharedâŠ, I, J) * (sharedâŠ, J, K) > (sharedâŠ, I, K). We say that J is the axis to be reduced, I is the vardim of source 1, and K is the vardim of source 2. I, J, K can also be multiple axes from the sources. The vardims donât need to exist. All other axes (sharedâŠ) are expected to match.
Parameters:  red1 (strinttuple[strint]list[strint]) â reduce axes of first source
 red2 (strinttuple[strint]list[strint]) â reduce axes of second source
 var1 (strinttuple[strint]list[strint]None) â var axes of first source
 var2 (strinttuple[strint]list[strint]None) â var axes of second source
 add_var2_if_empty (bool) â if var2=None, add dim=1 at the end
 debug (bool) â will print debug shapes, etc.

classmethod
get_out_data_from_opts
(name, sources, red1=1, red2=2, var1=2, var2=1, add_var2_if_empty=True, **kwargs)[source]Â¶ Parameters:  name (str) â
 sources (list[LayerBase]) â
 red1 (strinttuple[strint]list[strint]) â reduce axes of first source
 red2 (strinttuple[strint]list[strint]) â reduce axes of second source
 var1 (strinttuple[strint]list[strint]None) â var axes of first source
 var2 (strinttuple[strint]list[strint]None) â var axes of second source
 add_var2_if_empty (bool) â
Return type:
Elementwise Product LayerÂ¶

class
returnn.tf.layers.basic.
ElemwiseProdLayer
(axes, size=None, **kwargs)[source]Â¶ Elementwise product in some axes. Microsoft calls this âstatic attentionâ, in Deep Conv. NN with Layerwise Context Expansion and Attention (LACE). The matrix/tensor to be used for the product are given as a trainable parameter. See also
LinearLayer
.Parameters:  axes (strlist[str]) â e.g. âspatialâ, but all those axes must be of fixed dimension
 size (tuple[int]) â for doublechecking, you can explicitly provide the size
Gating LayerÂ¶

class
returnn.tf.layers.basic.
GatingLayer
(activation, gate_activation='sigmoid', **kwargs)[source]Â¶ Splits the output into two equal parts, applies the gate_activation (sigmoid by default) on the one part, some other activation (e.g. tanh) on the other part and then elementwise multiplies them. Thus, the output dimension is inputdimension / 2.
Linear LayerÂ¶

class
returnn.tf.layers.basic.
LinearLayer
(activation, with_bias=True, grad_filter=None, forward_weights_init='glorot_uniform', bias_init=0.0, use_transposed_weights=False, **kwargs)[source]Â¶ Linear/forward/fullyconnected/1x1conv layer. Does a linear transformation on the featuredimension of the input with an optional bias term and an optional activation function. See also
DotLayer
,ElemwiseProdLayer
,WeightedSumLayer
.Parameters:  activation (strNone) â e.g. âreluâ, or None
 with_bias (bool) â
 grad_filter (floatNone) â if grad norm is higher than this threshold (before activation), the grad is removed
 forward_weights_init (str) â see
TFUtil.get_initializer()
 recurrent_weights_init (str) â see
TFUtil.get_initializer()
 bias_init (strfloat) â see
TFUtil.get_initializer()
 use_transposed_weights (bool) â If True, define the weight matrix with transposed dimensions (n_out, n_in).
Pooling LayerÂ¶

class
returnn.tf.layers.basic.
PoolLayer
(mode, pool_size, padding='VALID', dilation_rate=1, strides=None, use_channel_first=False, **kwargs)[source]Â¶ A generic ND pooling layer. This would usually be done after a convolution for downsampling.
Parameters:  mode (str) â âmaxâ or âavgâ
 pool_size (tuple[int]) â shape of the window of each reduce
 padding (str) â âvalidâ or âsameâ
 dilation_rate (tuple[int]int) â
 strides (tuple[int]intNone) â in contrast to tf.nn.pool, the default (if it is None) will be set to pool_size
 use_channel_first (bool) â if set, will transform input to NCHW format

classmethod
get_out_data_from_opts
(name, pool_size, strides=None, dilation_rate=1, sources=(), padding='VALID', use_channel_first=False, **kwargs)[source]Â¶ Parameters:  name (str) â
 pool_size (tuple[int]list[int]) â
 strides (tuple[int]list[int]int) â
 dilation_rate (inttuple[int]list[int]) â
 sources (list[LayerBase]) â
 padding (str) â
 use_channel_first (bool) â
Return type:
Reduce LayerÂ¶

class
returnn.tf.layers.basic.
ReduceLayer
(mode, axes=None, axis=None, keep_dims=False, enforce_batch_dim_axis=None, use_time_mask=None, **kwargs)[source]Â¶ This reduces some axis by using âsumâ or âmaxâ. Itâs basically a wrapper around tf.reduce_sum or tf.reduce_max.
Parameters:  mode (str) â âsumâ or âmaxâ, âargminâ, âminâ, âargmaxâ, âmeanâ, âlogsumexpâ
 axes (intlist[int]str) â One axis or multiple axis to reduce.
It accepts the special tokens âBââbatchâ, âspatialâ, âspatial_except_timeâ, or âFââfeatureâ,
and it is strongly recommended to use some of these symbolic names.
See
Data.get_axes_from_description()
.  axis (intlist[int]str) â for compatibility, can be used instead of
axes
 keep_dims (bool) â if dimensions should be kept (will be 1)
 enforce_batch_dim_axis (int) â will swap the batchdimaxis of the input with the given axis. e.g. 0: will convert the input into batchmajor format if not already like that. Note that this is still not enough in some cases, e.g. when the other axes are also not as expected. The strong recommendation is to use a symbolic axis description.
 use_time_mask (bool) â if we reduce over the timedim axis, use the seq len info. By default, in that case, it will be True.

classmethod
need_enforce_batch_dim_axis
(axes)[source]Â¶ Parameters: axes (intlist[int]str) â Returns: if any integer is in axes, thus we should have a fixed dimension layout Return type: bool

classmethod
get_axes
(axis, input_data)[source]Â¶ Parameters:  axis â see self.__init__()
 input_data (Data) â
Returns: list of axes
Return type: list[int]

classmethod
get_out_data_from_opts
(name, sources, mode='', axes=None, axis=None, keep_dims=False, enforce_batch_dim_axis=None, **kwargs)[source]Â¶ Parameters:  name (str) â
 sources (list[LayerBase]) â
 mode (str) â (default here ââ because other code uses this function)
 axes (strlist[str]None) â
 axis (strNone) â
 keep_dims (bool) â
 enforce_batch_dim_axis (intNone) â
Return type:
ReduceOut LayerÂ¶

class
returnn.tf.layers.basic.
ReduceOutLayer
(mode, num_pieces, **kwargs)[source]Â¶ Combination of
SplitDimsLayer
applied to the feature dim andReduceLayer
applied to the resulting feature dim. This can e.g. be used to do maxout.Parameters:  mode (str) â âsumâ or âmaxâ or âmeanâ
 num_pieces (int) â how many elements to reduce. The output dimension will be input.dim // num_pieces.
Switch LayerÂ¶

class
returnn.tf.layers.basic.
SwitchLayer
(condition, true_from, false_from, **kwargs)[source]Â¶ Wrapper around
tf.where()
(or more genericallyTFUtil.where_bc()
), or statically choose a single source if the condition is a callable (âŠ)>bool. (tf.cond
is not useful here, as the sources would have been already constructed and computed.) See alsoCondLayer
.Parameters:  condition (LayerBasebool) â if callable, expected to be (âŠ)>bool, and called in transform_config_dict
 true_from (LayerBaseNone) â
 false_from (LayerBaseNone) â

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â will modify inplace
 network (returnn.tf.network.TFNetwork) â
 > LayerBase) get_layer (((str)) â function to get or construct another layer
Variable LayerÂ¶

class
returnn.tf.layers.basic.
VariableLayer
(shape, dtype='float32', add_batch_axis=True, add_time_axis=False, trainable=True, init=0, **kwargs)[source]Â¶ Represents a variable. Can add batch/time dimension if wanted. Can be trainable. See defaults.
Parameters:  shape (tuple[int]list[int]) â
 dtype (str) â
 add_batch_axis (bool) â
 add_time_axis (bool) â
 trainable (bool) â
 init (strfloatint) â see
TFUtil.get_initializer()

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â will modify inplace
 network (returnn.tf.network.TFNetwork) â
 > LayerBase) get_layer (((str)) â function to get or construct another layer
Weighted Sum LayerÂ¶

class
returnn.tf.layers.basic.
WeightedSumLayer
(axes, padding=None, size=None, keep_dims=None, **kwargs)[source]Â¶ Calculates a weighted sum, either over a complete axis of fixed dimension, or over some window. Can also do that for multiple axes. The weights are a trainable parameter matrix. Similar would be to use
ElemwiseProdLayer
andReduceLayer
, or just aDotLayer
with aVariableLayer
. See alsoLinearLayer
.Parameters:  axes (strlist[str]) â the axes to do the weightedsum over
 padding (str) â âvalidâ or âsameâ, in case of keep_dims=True
 size (Nonetuple[int]) â the kernelsize. if left away, the axes must be of fixed dimension, and we will use keep_dims=False, padding=âvalidâ by default. Otherwise, if given, you must also provide padding and keep_dims=True by default.
 keep_dims (bool) â if False, the axes will be squeezed away. see also size.
Window LayerÂ¶

class
returnn.tf.layers.basic.
WindowLayer
(window_size, window_left=None, window_right=None, axis='T', padding='same', **kwargs)[source]Â¶ Adds a window dimension. By default, uses the time axis and goes over it with a sliding window. The new axis for the window is created right after the time axis. Will always return as batch major mode. E.g. if the input is (batch, time, dim), the output is (batch, time, window_size, dim). If you want to merge the (window_size, dim) together to (window_size * dim,), you can use the MergeDimsLayer, e.g. {âclassâ: âmerge_dimsâ, âaxesâ: âexcept_timeâ}.
This is not to take out a window from the timedimension. See
SliceLayer
orSliceNdLayer
.Parameters:  window_size (int) â
 window_left (intNone) â
 window_right (intNone) â
 axis (strint) â see Data.get_axis_from_description()
 padding (str) â âsameâ or âvalidâ
 kwargs â

classmethod
get_out_data_from_opts
(name, window_size, axis='T', sources=(), **kwargs)[source]Â¶ Parameters:  name (str) â
 sources (list[LayerBase]) â
 window_size (int) â
 axis (str) â
Return type:

classmethod
get_rec_initial_extra_outputs
(batch_dim, rec_layer, window_size, axis='T', sources=(), **kwargs)[source]Â¶ Parameters:  batch_dim (tf.Tensor) â
 rec_layer (TFNetworkRecLayer.RecLayerLayerBase) â
 window_size (int) â
 axis (str) â
 sources (list[LayerBase]) â
Return type: dict[str,tf.Tensor]
Shape and Type ModificationÂ¶
Cast LayerÂ¶
Expand Dimensions LayerÂ¶

class
returnn.tf.layers.basic.
ExpandDimsLayer
(axis, dim=1, **kwargs)[source]Â¶ Adds some axis.
Parameters:  axis (strint) â axis to add, e.g. âFââfeatureâ or âspatialââtimeââTâ. if this is an integer, the input data is first converted into batchmajor mode, and then this is counted with batchdim.
 dim (int) â dimension of new axis (1 by default)
Merge Dimensions LayerÂ¶

class
returnn.tf.layers.basic.
MergeDimsLayer
(axes, n_out=None, **kwargs)[source]Â¶ Merges a list of axes into a single one. (Flatten the dims.) E.g. input is (batch, width, height, dim) and axes=(1,2), then we get (batch, width*height, dim). Or input is (batch, time, height, dim) and axes=âexcept_timeâ, then we get (batch, time, height*dim). See also
CombineDimsLayer
. When batch and time got merged,SplitBatchTimeLayer
can undo this.Parameters:  axes (strlist[str]list[int]) â see Data.get_axes_from_description(), e.g. âexcept_timeâ
 n_out (intNone) â

classmethod
get_out_data_from_opts
(name, axes, sources=(), n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, **kwargs)[source]Â¶ Parameters:  name (str) â
 axes (strlist[str]) â
 sources (list[LayerBase]) â
 n_out (intNoneNotSpecified) â
 out_type (Nonedict[str]) â
Return type:
Length LayerÂ¶
Pad LayerÂ¶

class
returnn.tf.layers.basic.
PadLayer
(axes, padding, value=0, mode='constant', **kwargs)[source]Â¶ Adds (e.g. zero) padding in some axis or axes.
Parameters:  axes (strlist[str]) â e.g. âFâ etc. see
Dataset.get_axes_from_description()
.  padding (list[(int,int)](int,int)int) â how much to pad left/right in each axis
 value (intfloat) â what constant value to pad, with mode==âconstantâ
 mode (str) â âconstantâ, âreflectâ or âsymmetricâ
 axes (strlist[str]) â e.g. âFâ etc. see
Postfix (in Time) LayerÂ¶

class
returnn.tf.layers.basic.
PostfixInTimeLayer
(postfix=0.0, repeat=1, **kwargs)[source]Â¶ Adds some postfix in time dimension.
Parameters:  postfix (floatintLayerBase) â constant or other layer without time axis to use as postfix
 repeat (int) â how often to repeat the postfix

classmethod
get_out_data_from_opts
(name, sources, postfix=0.0, **kwargs)[source]Â¶ Parameters:  name (str) â
 sources (list[LayerBase]) â
 postfix (floatintLayerBase) â constant or other layer without time axis to use as postfix
Return type:

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer â
Prefix (in Time) LayerÂ¶

class
returnn.tf.layers.basic.
PrefixInTimeLayer
(prefix=0.0, repeat=1, size_base=None, **kwargs)[source]Â¶ Adds some prefix in time dimension. This is kind of the reverse of
SliceNdLayer
does.Parameters:  prefix (floatstr) â either some constant or another layer
 repeat (intLayerBase) â how often to repeat the postfix
 size_base (LayerBaseNone) â copy seqlens from here

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â will modify inplace
 network (returnn.tf.network.TFNetwork) â
 > LayerBase) get_layer (((str)) â function to get or construct another layer
Resize LayerÂ¶

class
returnn.tf.layers.basic.
ResizeLayer
(factor, axis, kind='nn', fill_value=None, fill_dropout=None, **kwargs)[source]Â¶ Resizes the input, i.e. upsampling or downsampling. Supports different kinds, such as linear interpolation or nearestneighbor.
Parameters:  factor (int) â
 axis (strint) â the axis to resize, counted with batchdim. can also be âTâ for time
 kind (str) â âlinearâ, ânnâ/ânearest_neighborâ, âcubicâ, âfillâ
 fill_value (Noneintfloat) â if kind==âfillâ
 fill_dropout (float) â if set, will dropout in the same axis
Reinterpret Data LayerÂ¶

class
returnn.tf.layers.basic.
ReinterpretDataLayer
(switch_axes=None, size_base=None, set_axes=None, enforce_batch_major=False, enforce_time_major=False, set_sparse=None, set_sparse_dim=<class 'returnn.util.basic.NotSpecified'>, increase_sparse_dim=None, **kwargs)[source]Â¶ Acts like the
CopyLayer
but reinterprets the role of some axes or data.Parameters:  switch_axes (strlist[str]) â e.g. âbtâ to switch batch and time axes
 size_base (LayerBaseNone) â copy the size_placeholder from the given layer
 set_axes (dict[str,intstr]) â the key is âBâ,âTâ,âFâ, value is via
Data.get_axis_from_description()
 enforce_batch_major (bool) â
 enforce_time_major (bool) â
 set_sparse (boolNone) â if bool, set sparse value to this
 set_sparse_dim (intNoneNotSpecified) â set sparse dim to this. assumes that it is sparse
 increase_sparse_dim (intNone) â add this to the dim. assumes that it is sparse

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer â

classmethod
get_out_data_from_opts
(name, sources, switch_axes=None, size_base=None, set_axes=None, enforce_batch_major=False, enforce_time_major=False, set_sparse=None, set_sparse_dim=<class 'returnn.util.basic.NotSpecified'>, increase_sparse_dim=None, **kwargs)[source]Â¶ Parameters:  name (str) â
 sources (list[LayerBase]) â
 switch_axes (strlist[str]) â e.g. âbtâ to switch batch and time axes
 size_base (LayerBaseNone) â similar as size_target
 set_axes (dict[str,int]) â
 enforce_batch_major (bool) â
 enforce_time_major (bool) â
 set_sparse (boolNone) â if bool, set sparse value to this
 set_sparse_dim (intNoneNotSpecified) â set sparse dim to this. assumes that it is sparse
 increase_sparse_dim (intNone) â add this to the dim. assumes that it is sparse
Scatter ndim LayerÂ¶

class
returnn.tf.layers.basic.
ScatterNdLayer
(position, position_axis, output_dim_via_time_from, filter_invalid_indices=False, **kwargs)[source]Â¶ The inverse of
GatherNdLayer
. Mostly a wrapper fortf.scatter_nd
.The input to the layer are the
updates
, theindices
are via theposition
argument. The indices are into the newly constructed output dimension. The output shape is constructed via the common shape of the input, the position, and the the unique common axis (if not unique, we would need to introduce an option to specify it) is replaced by the given output dimension (currently viaoutput_dim_via_time_from
).Examples:
position (indices): (B,eTs) input (updates): (eTs,D) or (B,eTs,D) > expanded to (B,eTs,D) output shape: (B,eT,D) position (indices): (B,dT,eTs) input (updates): (eTs,D) > expanded to (B,dT,eTs,D) output shape: (B,dT,eT,D) position (indices): (dT,eTs) input (updates): (eTs,D) > expanded to (dT,eTs,D) output shape: (dT,eTs,D) position (indices): (dT,eTs) input (updates): (B,eTs,D) > expanded to (dT,eTs,B,D) output shape: (dT,eT,B,D)
In all these examples, output_dim_via_time_from is (B,eT,F), and eTs gets replaced by eT.
Parameters:  position (LayerBase) â indices into first axis (excluding batch) of the output
 position_axis (strint) â axis in position to replace by the outputdim
 output_dim_via_time_from (LayerBase) â use the timedim from this layer as the outputdim
 filter_invalid_indices (bool) â allow for indices <0 or >= output_dim, which will be discarded in the output

classmethod
get_out_data_from_opts
(name, sources, position, position_axis, output_dim_via_time_from, **kwargs)[source]Â¶ Parameters: Return type:

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer ((str)>LayerBase) â
ShiftAxisLayerÂ¶

class
returnn.tf.layers.basic.
ShiftAxisLayer
(axis, amount, pad=True, adjust_size_info=True, **kwargs)[source]Â¶ Shifts the dimensions in an axis around. This layer may change the axisdimension.
This name might be confusing. No axis will be shifted here. See
SwapAxesLayer
for that.Parameters:  axis (strint) â single axis to shift
 amount (int) â number of elements to shift (<0 for leftshift, >0 for rightshift)
 pad (bool) â preserve shape by padding
 adjust_size_info (bool) â whether to adjust the size_placeholder
Slice LayerÂ¶

class
returnn.tf.layers.basic.
SliceLayer
(axis, slice_start=None, slice_end=None, slice_step=None, **kwargs)[source]Â¶ Slicing on the input, i.e. x[start:end:step] in some axis. See also
SliceNdLayer
.Parameters:  axis (intstr) â
 axis_kind (strNone) â âTâ for time, âBâ for batch, âFâ for feature
 slice_start (intNone) â
 slice_end (intNone) â
 slice_step (intNone) â
Slice ndim LayerÂ¶

class
returnn.tf.layers.basic.
SliceNdLayer
(start, size, min_size=None, **kwargs)[source]Â¶ This takes out a slicerange from some axis, e.g.
x[start:start + size]
. This layers allows a different start slice point for each batch, in contrast toSliceLayer
, and the start is variable. See alsoGatherNdLayer
.PrefixInTimeLayer
can recover the original shape (by zeropadding).Parameters:  start (LayerBase) â
 size (intNone) â if None, it uses the max possible size, and it becomes a dynamic axis
 min_size (intNone) â if size is None, but we want to have a minsize, set this

classmethod
get_out_data_from_opts
(name, sources=(), start=None, size=None, **kwargs)[source]Â¶ Parameters:  name (str) â
 sources (list[LayerBase]) â
 start (LayerBaseNone) â
 size (intNone) â
Return type:

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer â
Split Batch Time LayerÂ¶

class
returnn.tf.layers.basic.
SplitBatchTimeLayer
(base, **kwargs)[source]Â¶ A very specific layer which expects to get input of shape (batch * time, âŠ) and converts it into (batch, time, âŠ), where it recovers the seqlens from some other layer. See
SplitDimsLayer
for a more generic layer.Parameters: base (LayerBase) â used to recover the seqlens 
classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer â

classmethod
Split Dimensions LayerÂ¶

class
returnn.tf.layers.basic.
SplitDimsLayer
(axis, dims, **kwargs)[source]Â¶ Splits one axis into multiple axes. E.g. if you know that your featuredim is composed by a window, i.e. the input is (batch, time, window * feature), you can set axis=âFâ, dims=(window, 1), and you will get the output (batch, time, window, feature). Also see
SplitBatchTimeLayer
.Parameters:  axis (str) â e.g. âFâ
 dims (tuple[int]) â what the axis should be split into. e.g. (window, 1)
Squeeze LayerÂ¶

class
returnn.tf.layers.basic.
SqueezeLayer
(axis, enforce_batch_dim_axis=None, allow_no_op=False, **kwargs)[source]Â¶ Removes an axis with dimension 1. This is basically a wrapper around tf.squeeze.
Parameters:  axis (intlist[int]str) â one axis or multiple axis to squeeze. this is counted with batchdim, which by default is axis 0 (see enforce_batch_dim_axis). it also accepts the special tokens âBââbatchâ, âspatialâ, âspatial_except_timeâ, or âFââfeatureâ
 enforce_batch_dim_axis (intNone) â
 allow_no_op (bool) â
Stack LayerÂ¶
Swap Axes LayerÂ¶

class
returnn.tf.layers.basic.
SwapAxesLayer
(axis1, axis2, **kwargs)[source]Â¶ Swaps two axes. Basically a wrapper around
TFUtil.swapaxes()
. See alsoReinterpretDataLayer
.Parameters:  axis1 (intstr) â
 axis2 (intstr) â
Time Chunking LayerÂ¶
Time UnChunking LayerÂ¶

class
returnn.tf.layers.basic.
TimeUnChunkingLayer
(chunking_layer, **kwargs)[source]Â¶ Performs chunking in time. See
TFNativeOp.chunk()
.Parameters: chunking_layer (TimeChunkingLayer) â 
classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer â

classmethod
Recurrent LayersÂ¶
Choice LayerÂ¶

class
returnn.tf.layers.rec.
ChoiceLayer
(beam_size, keep_beams=False, search=<class 'returnn.util.basic.NotSpecified'>, input_type='prob', prob_scale=1.0, base_beam_score_scale=1.0, random_sample_scale=0.0, length_normalization=True, custom_score_combine=None, source_beam_sizes=None, scheduled_sampling=False, cheating=False, explicit_search_sources=None, **kwargs)[source]Â¶ This layer represents a choice to be made in search during inference, such as choosing the topk outputs from a logsoftmax for beam search. During training, this layer can return the true label. This is supposed to be used inside the rec layer. This can be extended in various ways.
We present the scores in +log space, and we will add them up along the path. Assume that we get input (batch,dim) from a (log)softmax. Assume that each batch is already a choice via search. In search with a beam size of N, we would output sparse (batch=N,) and scores for each.
In case of multiple sources, this layer computes the topk combinations of choices. The score of such a combination is determined by adding up the (logspace) scores of the choices for the individual sources. In this case, the âtargetâ parameter of the layer has to be set to a list of targets corresponding to the sources respectively. Because computing all possible combinations of source scores is costly, the sources are pruned beforehand using the beam sizes set by the âsource_beam_sizesâ parameter. The choices made for the different sources can be accessed via the sublayers â<choice layer name>/out_0â, â<choice layer name>/out_1â and so on. Note, that the way scores are combined assumes the sources to be independent. If you want to model a dependency, use separate ChoiceLayers and let the input of one depend on the output of the other.
Parameters:  beam_size (int) â the outgoing beam size. i.e. our output will be (batch * beam_size, âŠ)
 keep_beams (bool) â specifies that we keep the beam_in entries, i.e. we just expand, i.e. we just search on the dim. beam_size must be a multiple of beam_in.
 search (NotSpecifiedbool) â whether to perform search, or use the ground truth (target option). If not specified, it will depend on network.search_flag.
 input_type (str) â âprobâ or âlog_probâ, whether the input is in probability space, logspace, etc. or âregressionâ, if it is a prediction of the data asis. If there are several inputs, same format for all is assumed.
 prob_scale (float) â factor for prob (score in +log space from source)
 base_beam_score_scale (float) â factor for beam base score (i.e. prev prob scores)
 random_sample_scale (float) â if >0, will add Gumbel scores. you might want to set base_beam_score_scale=0
 length_normalization (bool) â evaluates score_t/len in search
 source_beam_sizes (list[int]None) â If there are several sources, they are pruned with these beam sizes before combination. If None, âbeam_sizeâ is used for all sources. Has to have same length as number of sources.
 scheduled_sampling (dictNone) â
 cheating (boolstr) â if True, will always add the true target in the beam.
if âexclusiveâ, enables cheating_exclusive. see
TFUtil.beam_search()
.  explicit_search_sources (list[LayerBase]None) â will mark it as an additional dependency. You might use these also in custom_score_combine.
 custom_score_combine (callableNone) â

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â will modify inplace
 network (returnn.tf.network.TFNetwork) â
 > LayerBase) get_layer (((str)) â function to get or construct another layer

classmethod
get_out_data_from_opts
(name, sources, target, network, beam_size, search=<class 'returnn.util.basic.NotSpecified'>, scheduled_sampling=False, cheating=False, **kwargs)[source]Â¶ Parameters:  name (str) â
 sources (list[LayerBase]) â
 target (str) â
 network (returnn.tf.network.TFNetwork) â
 beam_size (int) â
 search (NotSpecifiedbool) â
 scheduled_sampling (dictbool) â
 cheating (bool) â
Return type:

get_sub_layer
(layer_name)[source]Â¶ Used to get outputs in case of multiple targets. For all targets we create a sublayer that can be referred to as âself.name + â/out_â + indexâ (e.g. output/out_0). These sublayers can then be used as input to other layers, e.g. âoutput_0â: {âclassâ: âcopyâ, âfromâ: [âoutput/out_0â].
Parameters: layer_name (str) â name of the sub_layer (e.g. âout_0â) Returns: internal layer that outputs labels for the target corresponding to layer_name Return type: InternalLayer

classmethod
get_sub_layer_out_data_from_opts
(layer_name, parent_layer_kwargs)[source]Â¶ Parameters:  layer_name (str) â name of the sub_layer (e.g. âout_0â), see self.get_sub_layer()
 parent_layer_kwargs (dict[str]) â kwargs for the parent layer
Returns: Data template, network and the class type of the sublayer
Return type:
Decision LayerÂ¶

class
returnn.tf.layers.rec.
DecideLayer
(length_normalization=False, **kwargs)[source]Â¶ This is kind of the counterpart to the choice layer. This only has an effect in search mode. E.g. assume that the input is of shape (batch * beam, time, dim) and has search_sources set. Then this will output (batch, time, dim) where the beam with the highest score is selected. Thus, this will do a decision based on the scores. In will convert the data to batchmajor mode.
Parameters: length_normalization (bool) â performed on the beam scores 
classmethod
cls_get_search_beam_size
(network=None, **kwargs)[source]Â¶ Parameters: network (returnn.tf.network.TFNetwork) â Return type: intNone

classmethod
decide
(src, output=None, owner=None, name=None, length_normalization=False)[source]Â¶ Parameters:  src (LayerBase) â with search_choices set. e.g. input of shape (batch * beam, time, dim)
 output (DataNone) â
 owner (LayerBaseNone) â
 name (strNone) â
 length_normalization (bool) â performed on the beam scores
Returns: best beam selected from input, e.g. shape (batch, time, dim)
Return type: (Data, SearchChoicesNone)

classmethod
get_out_data_from_opts
(name, sources, network, **kwargs)[source]Â¶ Parameters:  name (str) â
 sources (list[LayerBase]) â
 network (returnn.tf.network.TFNetwork) â
Return type:

classmethod
Get Accumulated Output LayerÂ¶

class
returnn.tf.layers.rec.
GetRecAccumulatedOutputLayer
(sub_layer, **kwargs)[source]Â¶ For
RecLayer
with a subnet. If some layer is explicitly marked as an additional output layer (via âis_output_layerâ: True), you can get that subnet layer output via this accessor. Retrieves the accumulated output.Note that this functionality is obsolete now. You can simply access such an sub layer via the generic sub layer access mechanism. I.e. instead of:
"sub_layer": {"class": "get_rec_accumulated", "from": "rec_layer", "sub_layer": "hidden"}
You can do:
"sub_layer": {"class": "copy", "from": "rec_layer/hidden"}
Parameters: sub_layer (str) â layer of subnet in RecLayer source, which has âis_output_layerâ: True
Positional Encoding LayerÂ¶

class
returnn.tf.layers.rec.
PositionalEncodingLayer
(add_to_input=False, constant=1, offset=None, **kwargs)[source]Â¶ Provides positional encoding in the form of (batch, time, n_out) or (time, batch, n_out) where n_out is the number of channels, if it is run outside a
RecLayer
, and (batch, n_out) or (n_out, batch) if run inside aRecLayer
, where it will depend on the current time frame.Assumes one source input with a time dimension if outside a
RecLayer
. With add_to_input, it will calculate x + input, and the output shape is the same as the inputThe positional encoding is the same as in Tensor2Tensor. See
TFUtil.get_positional_encoding()
.Parameters:  add_to_input (bool) â will add the signal to the input
 constant (int) â if positive, always output the corresponding positional encoding.
 offset (NoneLayerBase) â Specify the offset to be added to positions. Expect shape (batch, time) or (batch,).

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer (((str)>LayerBase)) â

classmethod
get_out_data_from_opts
(name, network, add_to_input=False, sources=(), **kwargs)[source]Â¶ Parameters:  name (str) â
 network (returnn.tf.network.TFNetwork) â
 add_to_input (bool) â
 sources (list[LayerBase]) â
Return type:
Recurrent LayerÂ¶

class
returnn.tf.layers.rec.
RecLayer
(unit='lstm', unit_opts=None, direction=None, input_projection=True, initial_state=None, max_seq_len=None, forward_weights_init=None, recurrent_weights_init=None, bias_init=None, optimize_move_layers_out=None, cheating=False, unroll=False, back_prop=None, use_global_rec_step_offset=False, include_eos=False, debug=None, **kwargs)[source]Â¶ Recurrent layer, has support for several implementations of LSTMs (via
unit
argument), see TensorFlow LSTM Benchmark (http://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html), and also GRU, or simple RNN. Via unit parameter, you specify the operation/model performed in the recurrence. It can be a string and specify a RNN cell, where all TF cells can be used, and the âCellâ suffix can be omitted; and case is ignored. Some possible LSTM implementations are (in all cases for both CPU and GPU): BasicLSTM (the cell), via official TF, pure TF implementation
 LSTMBlock (the cell), via tf.contrib.rnn.
 LSTMBlockFused, via tf.contrib.rnn. should be much faster than BasicLSTM
 CudnnLSTM, via tf.contrib.cudnn_rnn. This is experimental yet.
 NativeLSTM, our own native LSTM. should be faster than LSTMBlockFused.
 NativeLstm2, improved own native LSTM, should be the fastest and most powerful.
We default to the current tested fastest one, i.e. NativeLSTM. Note that they are currently not compatible to each other, i.e. the way the parameters are represented.
A subnetwork can also be given which will be evaluated stepbystep, which can use attention over some separate input, which can be used to implement a decoder in a sequencetosequence scenario. The subnetwork will get the extern data from the parent net as templates, and if there is input to the RecLayer, then it will be available as the âsourceâ data key in the subnetwork. The subnetwork is specified as a dict for the unit parameter. In the subnetwork, you can access outputs from layers from the previous time step when they are referred to with the âprev:â prefix.
Example:
{ "class": "rec", "from": ["input"], "unit": { # Recurrent subnet here, operate on a single timestep: "output": { "class": "linear", "from": ["prev:output", "data:source"], "activation": "relu", "n_out": n_out}, }, "n_out": n_out}, }
More examples can be seen in
test_TFNetworkRecLayer
andtest_TFEngine
.The subnetwork can automatically optimize the inner recurrent loop by moving layers out of the loop if possible. It will try to do that greedily. This can be disabled via the option optimize_move_layers_out. It assumes that those layers behave the same with timedimension or without timedimension and used perstep. Examples for such layers are
LinearLayer
,RnnCellLayer
orSelfAttentionLayer
with option attention_left_only.This layer can also be inside another RecLayer. In that case, it behaves similar to
RnnCellLayer
. (This support is somewhat incomplete yet. It should work for the native units such as NativeLstm.)Parameters:  unit (strdict[str,dict[str]]) â the RNNCell/etc name, e.g. ânativelstmâ. see comment below. alternatively a whole subnetwork, which will be executed step by step, and which can include âprevâ in addition to âfromâ to refer to previous steps.
 unit_opts (Nonedict[str]) â passed to RNNCell creation
 direction (intNone) â None1 > forward, 1 > backward
 input_projection (bool) â True > input is multiplied with matrix. False only works if same input dim
 initial_state (LayerBasestrfloatinttupleNone) â
 max_seq_len (inttf.TensorNone) â if unit is a subnetwork. str will be evaluated. see code
 forward_weights_init (str) â see
TFUtil.get_initializer()
 recurrent_weights_init (str) â see
TFUtil.get_initializer()
 bias_init (str) â see
TFUtil.get_initializer()
 optimize_move_layers_out (boolNone) â will automatically move layers out of the loop when possible
 cheating (bool) â Unused, is now part of ChoiceLayer
 unroll (bool) â if possible, unroll the loop (implementation detail)
 back_prop (boolNone) â for tf.while_loop. the default will use self.network.train_flag
 use_global_rec_step_offset (bool) â
 include_eos (bool) â for search, whether we should include the frame where âendâ is True
 debug (boolNone) â

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ This method transforms the templates in the config dictionary into references of the layer instances (and creates them in the process). :param dict[str] d: will modify inplace :param returnn.tf.network.TFNetwork network: :param ((str) > LayerBase) get_layer: function to get or construct another layer

classmethod
get_out_data_from_opts
(network, unit, sources=(), initial_state=None, **kwargs)[source]Â¶ Parameters:  network (returnn.tf.network.TFNetwork) â
 unit (strdict[str]) â
 sources (list[LayerBase]) â
 initial_state (strLayerBaselist[strLayerBase]) â
Return type:

classmethod
get_rec_initial_extra_outputs
(**kwargs)[source]Â¶ Return type: dict[str,tf.Tensortuple[tf.Tensor]]

classmethod
get_rnn_cell_class
(name, cell_only=False)[source]Â¶ Parameters:  name (str) â cell name, minus the âCellâ at the end
 cell_only (bool) â i.e. for singlestep execution
Return type: type[rnn_cell.RNNCell]type[returnn.tf.native_op.RecSeqCellOp]

classmethod
get_losses
(name, network, output, loss=None, reduce_func=None, layer=None, **kwargs)[source]Â¶ Parameters:  name (str) â layer name
 network (returnn.tf.network.TFNetwork) â
 loss (LossNone) â argument just as for __init__
 output (Data) â the output (template) for the layer
 reduce_func (((tf.Tensor)>tf.Tensor)None) â
 layer (LayerBaseNone) â
 kwargs â other layer kwargs
Return type:

static
convert_cudnn_canonical_to_lstm_block
(reader, prefix, target='lstm_block_wrapper/')[source]Â¶ This assumes CudnnLSTM currently, with num_layers=1, input_mode=âlinear_inputâ, direction=âunidirectionalâ!
Parameters:  reader (tf.train.CheckpointReader) â
 prefix (str) â e.g. âlayer2/rec/â
 target (str) â e.g. âlstm_block_wrapper/â or ârnn/lstm_cell/â
Returns: dict key > value, {ââŠ/kernelâ: âŠ, ââŠ/biasâ: âŠ} with prefix
Return type: dict[str,numpy.ndarray]
Parameters: key (strintNone) â Return type: tf.Tensor
RNN Cell LayerÂ¶

class
returnn.tf.layers.rec.
RnnCellLayer
(n_out, unit, unit_opts=None, initial_state=None, initial_output=None, weights_init='xavier', **kwargs)[source]Â¶ Wrapper around tf.contrib.rnn.RNNCell. This will operate a single step, i.e. there is no time dimension, i.e. we expect a (batch,n_in) input, and our output is (batch,n_out). This is expected to be used inside a RecLayer. (But it can also handle the case to be optimized out of the rec loop,
i.e. outside a RecLayer, with a time dimension.)Parameters:  n_out (int) â so far, only output shape (batch,n_out) supported
 unit (strtf.contrib.rnn.RNNCell) â e.g. âBasicLSTMâ or âLSTMBlockâ
 unit_opts (dict[str]None) â passed to the cell.__init__
 initial_state (strfloatLayerBasetuple[LayerBase]dict[LayerBase]) â see self.get_rec_initial_state(). This will be set via transform_config_dict(). To get the state from another recurrent layer, use the GetLastHiddenStateLayer (get_last_hidden_state).
 initial_output (None) â the initial output is defined implicitly via initial state, thus donât set this

classmethod
get_out_data_from_opts
(n_out, name, sources=(), **kwargs)[source]Â¶ Parameters:  n_out (int) â
 name (str) â layer name
 sources (list[LayerBase]) â
Return type:
Parameters:  n_out (int) â
 unit (str) â
 unit_opts (dict[str]None) â
Returns: size or tuple of sizes
Return type: inttuple[int]

classmethod
get_output_from_state
(state, unit)[source]Â¶ Parameters:  state (tuple[tf.Tensor]tf.Tensor) â
 unit (str) â
Return type: tf.Tensor
Returns: state as defined by the cell Return type: tuple[tf.Tensor]tf.Tensor

classmethod
get_state_by_key
(state, key, shape=None)[source]Â¶ Parameters:  state (tf.Tensortuple[tf.Tensor]namedtuple) â
 key (intstrNone) â
 shape (tuple[intNone]) â Shape of the state.
Return type: tf.Tensor
Parameters: key (intstrNone) â Return type: tf.Tensor

classmethod
get_rec_initial_state
(batch_dim, name, n_out, unit, initial_state=None, unit_opts=None, rec_layer=None, **kwargs)[source]Â¶ Very similar to
get_rec_initial_output()
. Initial hidden state when used inside a recurrent layer for the frame t=1, if it is needed. As arguments, we get the usual layer arguments. batch_dim is added because it might be special because of beam search. Also seetransform_config_dict()
for initial_state.Note: This could maybe share code with
get_rec_initial_output()
, although it is a bit more generic here because the state can also be a namedtuple or any kind of nested structure.Parameters:  batch_dim (tf.Tensor) â including beam size in beam search
 name (str) â layer name
 n_out (int) â out dim
 unit (str) â cell name
 unit_opts (dict[str]None) â
 initial_state (LayerBasestrintfloatNonelisttuplenamedtuple) â see code
 rec_layer (RecLayerLayerBaseNone) â for the scope
Return type: tf.Tensortuple[tf.Tensor]namedtuple

classmethod
get_rec_initial_state_inner
(initial_shape, name, state_key='state', key=None, initial_state=None, shape_invariant=None, rec_layer=None)[source]Â¶ Generate initial hidden state. Primarily used as a inner function for RnnCellLayer.get_rec_initial_state.
Parameters:  initial_shape (tuple) â shape of the initial state.
 name (str) â layer name.
 state_key (str) â key to be used to get the state from final_rec_vars.
 key (strNone) â key/attribute of the state if state is a dictionary/namedtuple (like âcâ and âhâ for LSTM states).
 initial_state (LayerBasestrintfloatNonelisttuplenamedtuple) â see code
 shape_invariant (tuple) â If provided, directly used. Otherwise, guessed from initial_shape (see code below).
 rec_layer (RecLayerLayerBaseNone) â For the scope.
Return type: tf.Tensor

classmethod
get_rec_initial_extra_outputs
(**kwargs)[source]Â¶ Return type: dict[str,tf.Tensortuple[tf.Tensor]]

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â will modify inplace
 network (returnn.tf.network.TFNetwork) â
 > LayerBase) get_layer (((str)) â function to get or construct another layer

static
transform_initial_state
(initial_state, network, get_layer)[source]Â¶ Parameters:  initial_state (strfloatintlist[strfloatint]dict[str]None) â
 network (returnn.tf.network.TFNetwork) â
 > LayerBase) get_layer (((str)) â function to get or construct another layer
SelfAttention LayerÂ¶

class
returnn.tf.layers.rec.
SelfAttentionLayer
(num_heads, total_key_dim, key_shift=None, forward_weights_init='glorot_uniform', attention_dropout=0.0, attention_left_only=False, initial_state=None, restrict_state_to_last_seq=False, state_var_lengths=None, **kwargs)[source]Â¶ Applies selfattention on the input. I.e., with input x, it will basically calculate
att(Q x, K x, V x),where att is multihead dotattention for now, Q, K, V are matrices. The attention will be over the timedimension. If there is no timedimension, we expect to be inside a
RecLayer
; also, this is only valid with attention_to_past_only=True. See also dot_product_attention here:
 https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py
Parameters:  num_heads (int) â
 total_key_dim (int) â i.e. key_dim == total_key_dim // num_heads
 key_shift (LayerBaseNone) â additive term to the key. can be used for relative positional encoding. Should be of shape (num_queries,num_keys,key_dim), currently without batchdimension. I.e. that should be shape (1,t,key_dim) inside reclayer or (T,T,key_dim) outside.
 forward_weights_init (str) â see
TFUtil.get_initializer()
 attention_dropout (float) â
 attention_left_only (bool) â will mask out the future. see Attention is all you need.
 initial_state (strfloatintNone) â see RnnCellLayer.get_rec_initial_state_inner().
 restrict_state_to_last_seq (bool) â see code comment below
 state_var_lengths (Nonetf.Tensor()>tf.Tensor) â if passed, a Tensor containing the number of keys in the state_var for each batchentry, used for decoding in RASR.

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer â

classmethod
get_out_data_from_opts
(n_out, name, sources, **kwargs)[source]Â¶ Parameters:  n_out (int) â
 name (str) â
 sources (list[LayerBase]) â
Return type:

classmethod
get_rec_initial_extra_outputs
(batch_dim, rec_layer, num_heads, total_key_dim, n_out, name, initial_state=None, sources=(), **kwargs)[source]Â¶ Parameters:  batch_dim (tf.Tensor) â
 rec_layer (RecLayerLayerBase) â
 num_heads (int) â
 total_key_dim (int) â
 n_out (int) â
 name (str) â
 initial_state (strfloatintNone) â
 sources (list[LayerBase]) â
Return type: dict[str, tf.Tensor]
Attention LayersÂ¶
Concatenative Attention LayerÂ¶

class
returnn.tf.layers.rec.
ConcatAttentionLayer
(**kwargs)[source]Â¶ Additive attention / tanhconcat attention as similarity measure between base_ctx and source. This is used by Montreal, where as Stanford compared this to the dotattention. The concatattention is maybe more standard for machine translation at the moment.
DotProduct Attention LayerÂ¶

class
returnn.tf.layers.rec.
DotAttentionLayer
(energy_factor=None, **kwargs)[source]Â¶ Classic global attention: Dotproduct as similarity measure between base_ctx and source.
Parameters:  base (LayerBase) â encoder output to attend on. defines outputdim
 base_ctx (LayerBase) â encoder output used to calculate the attention weights, combined with inputdata. dim must be equal to inputdata
 energy_factor (floatNone) â the energy will be scaled by this factor. This is like a temperature for the softmax. In Attentionisallyouneed, this is set to 1/sqrt(base_ctx.dim).
Gauss Window Attention LayerÂ¶

class
returnn.tf.layers.rec.
GaussWindowAttentionLayer
(window_size, std=1.0, inner_size=None, inner_size_step=0.5, **kwargs)[source]Â¶ Interprets the incoming source as the location (float32, shape (batch,)) and returns a gausswindowweighting of the base around the location. The window size is fixed (TODO: but the variance can optionally be dynamic).
Parameters:  window_size (int) â the window size where the Gaussian window will be applied on the base
 std (float) â standard deviation for Gauss
 inner_size (intNone) â if given, the output will have an additional dimension of this size, where t is shifted by +/ inner_size_step around. e.g. [t1,t0.5,t,t+0.5,t+1] would be the locations with inner_size=5 and inner_size_step=0.5.
 inner_size_step (float) â see inner_size above
Generic Attention LayerÂ¶

class
returnn.tf.layers.rec.
GenericAttentionLayer
(weights, auto_squeeze=True, **kwargs)[source]Â¶ The weighting for the base is specified explicitly here. This can e.g. be used together with
SoftmaxOverSpatialLayer
. Note that we do not do any masking here. E.g.SoftmaxOverSpatialLayer
does that.Note that
DotLayer
is similar, just using a different terminology. Reduce axis: weights: timeaxis; base: timeaxis.Note that if the last layer wasSoftmaxOverSpatialLayer
, we should use the same timeaxis. Also we should do a check whether these time axes really match.Common axes (should match): batchaxis, all from base excluding base feature axis and excluding time axis. Keep axes: base: feature axis; weights: all remaining, e.g. extra time.
Parameters: 
classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer â

classmethod
SelfAttention LayerÂ¶

class
returnn.tf.layers.rec.
SelfAttentionLayer
(num_heads, total_key_dim, key_shift=None, forward_weights_init='glorot_uniform', attention_dropout=0.0, attention_left_only=False, initial_state=None, restrict_state_to_last_seq=False, state_var_lengths=None, **kwargs)[source]Â¶ Applies selfattention on the input. I.e., with input x, it will basically calculate
att(Q x, K x, V x),where att is multihead dotattention for now, Q, K, V are matrices. The attention will be over the timedimension. If there is no timedimension, we expect to be inside a
RecLayer
; also, this is only valid with attention_to_past_only=True. See also dot_product_attention here:
 https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py
Parameters:  num_heads (int) â
 total_key_dim (int) â i.e. key_dim == total_key_dim // num_heads
 key_shift (LayerBaseNone) â additive term to the key. can be used for relative positional encoding. Should be of shape (num_queries,num_keys,key_dim), currently without batchdimension. I.e. that should be shape (1,t,key_dim) inside reclayer or (T,T,key_dim) outside.
 forward_weights_init (str) â see
TFUtil.get_initializer()
 attention_dropout (float) â
 attention_left_only (bool) â will mask out the future. see Attention is all you need.
 initial_state (strfloatintNone) â see RnnCellLayer.get_rec_initial_state_inner().
 restrict_state_to_last_seq (bool) â see code comment below
 state_var_lengths (Nonetf.Tensor()>tf.Tensor) â if passed, a Tensor containing the number of keys in the state_var for each batchentry, used for decoding in RASR.

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer â

classmethod
get_out_data_from_opts
(n_out, name, sources, **kwargs)[source]Â¶ Parameters:  n_out (int) â
 name (str) â
 sources (list[LayerBase]) â
Return type:

classmethod
get_rec_initial_extra_outputs
(batch_dim, rec_layer, num_heads, total_key_dim, n_out, name, initial_state=None, sources=(), **kwargs)[source]Â¶ Parameters:  batch_dim (tf.Tensor) â
 rec_layer (RecLayerLayerBase) â
 num_heads (int) â
 total_key_dim (int) â
 n_out (int) â
 name (str) â
 initial_state (strfloatintNone) â
 sources (list[LayerBase]) â
Return type: dict[str, tf.Tensor]
Norm and Regularization LayersÂ¶
Generic Normalization LayerÂ¶

class
returnn.tf.layers.basic.
NormLayer
(axes, param_shape='F', scale=True, bias=True, epsilon=1e06, **kwargs)[source]Â¶ Normalize over specified axes, e.g. time and/or feature axis. In case of just feature (
axes="F"
), this corresponds to layer normalization (seeLayerNormLayer
). In case of time and feature (axes="TF"
) for a 3D input, or more general all except batch (axes="except_batch"
), this corresponds to group normalization with G=1, or nonstandard layer normalization. (The definition of layernormalization is not clear on what axes should be normalized over. In many other frameworks, the default axis is just the last axis, which is usually the feature axis. However, in certain implementations and models, it is also common to normalize over all axes except batch.)The statistics are calculated just on the input. There are no running statistics (in contrast to batch normalization, see
BatchNormLayer
).For some discussion on the definition of layernorm vs groupnorm, also see here and here.
Parameters:  axes (strlist[str]) â axes over which the mean and variance are computed, e.g. âFâ or âTFâ
 param_shape (strlist[str]tuple[str]intlist[int]tuple[int]) â shape of the scale and bias parameters. You can also refer to (static) axes of the input, such as the featuredim. This is also the default, i.e. a paramshape of [F], independent of the axes to normalize over.
 scale (bool) â add trainable scale parameters
 bias (bool) â add trainable bias parameters
 epsilon (float) â epsilon for numerical stability
BatchNormalization LayerÂ¶

class
returnn.tf.layers.basic.
BatchNormLayer
(**kwargs)[source]Â¶ Implements batchnormalization (http://arxiv.org/abs/1502.03167) as a separate layer.
Also see
NormLayer
.All kwargs which are present in our base class are passed to our base class. All remaining kwargs are used for self.batch_norm().
LayerNormalization LayerÂ¶

class
returnn.tf.layers.basic.
LayerNormLayer
(epsilon=1e06, **kwargs)[source]Â¶ Applies layernormalization.
Note that we just normalize over the featuredim axis here. This is consistent to the default behavior of
tf.keras.layers.LayerNormalization
and also how it is commonly used in many models, including Transformer.However, there are cases where it would be common to normalize over all axes except batchdim, or all axes except batch and time. For a more generic variant, see
NormLayer
.
Dropout LayerÂ¶

class
returnn.tf.layers.basic.
DropoutLayer
(extra_deps=(), **kwargs)[source]Â¶ Just the same as
CopyLayer
, because that one already supports dropout.Parameters: extra_deps (list[LayerBase]) â Just add as an additional dependency, without really using it. This can have an effect though on the search beam, via SelectSearchSourcesLayer
. We only have this here for theCopyLayer
because theget_out_data_from_opts()
must know about it and define the right beam. Also see the optioncollocate_with
, which is different in that it does not add a dependency.
Custom LayersÂ¶
Eval LayerÂ¶

class
returnn.tf.layers.basic.
EvalLayer
(eval, **kwargs)[source]Â¶ Evaluates some string. The
CombineLayer
provides this functionality, thus this is just a special case of it. Also seeActivationLayer
, orCompareLayer
.The output type is defined as a broadcasted extension of all sources. You can overwrite it by (partially) specifying out_type. out_type can also be a generic Python function, returning a Data instance.
Parameters: eval (str) â will eval this string. see _op_kind_eval()
Subnetwork LayerÂ¶

class
returnn.tf.layers.basic.
SubnetworkLayer
(subnetwork, concat_sources=True, load_on_init=None, dropout=0, dropout_noise_shape=None, _parent_layer_cache=None, **kwargs)[source]Â¶ You can define a whole subnetwork as a single layer by this class.
The subnetwork will be specified by a
dict[str,dict[str]]
, just like a normal network is specified in the config.The
"output"
layer of the subnetwork will be the output of this subnetworklayer. With
concat_sources=True
(default),  the input to this layer will be represented as the
"data:data"
or simply"data"
in the subnetwork,  otherwise with
concat_sources=False
,  the input to this layer will be represented as
"data:input_layer_name"
for each input, in the subnetwork.
Parameters:  subnetwork (dict[str,dict]) â subnetwork as dict (JSON content). must have an âoutputâ layer
 concat_sources (bool) â if we concatenate all sources into one, like it is standard for most other layers
 load_on_init (strdict[str]None) â if provided, for parameter initialization,
we will load the given model file. see
CustomCheckpointLoader
.  dropout (float) â will be applied if train_flag is set
 dropout_noise_shape (tuplelistdictNone) â
 _parent_layer_cache (dict[str,LayerBase]None) â

classmethod
get_out_data_from_opts
(subnetwork, concat_sources=True, n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, **kwargs)[source]Â¶ Parameters:  subnetwork (dict[str,dict[str]]) â
 concat_sources (bool) â
 n_out (intNoneNotSpecified) â
 out_type (dict[str]None) â
Return type:

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer â

classmethod
get_losses
(name, network, output, loss=None, reduce_func=None, layer=None, **kwargs)[source]Â¶ Parameters:  name (str) â layer name
 network (returnn.tf.network.TFNetwork) â
 loss (LossNone) â argument just as for __init__
 output (Data) â the output (template) for the layer
 layer (LayerBaseNone) â
 reduce_func (((tf.Tensor)>tf.Tensor)None) â
 kwargs â other layer kwargs
Return type:

get_sub_layer
(layer_name)[source]Â¶ Parameters: layer_name (str) â name of the sub_layer (right part of â/â separated path) Returns: the sub_layer addressed in layer_name or None if no sub_layer exists Return type: LayerBaseNone
Parameters: key (intstrNone) â also the special key â*â Return type: tf.TensorNone
 With
Utility LayersÂ¶
Framewise Statistics LayerÂ¶
HDFDumpLayerÂ¶

class
returnn.tf.layers.basic.
HDFDumpLayer
(filename, extra=None, dump_whole_batches=False, labels=None, extend_existing_file=False, dump_per_run=False, **kwargs)[source]Â¶ Dumps into HDF file, compatible to
HDFDataset
.The HDF will be written to disk under the specified filename, if there was no error, by default at graph reset, via
TFNetwork.register_graph_reset_callback()
. Or after the dataset iteration run loop, with dump_per_run, viaTFNetwork.register_run_finished_callback()
.Common usage would be to add this to your network with âis_output_layerâ: True, such that you donât need to make other layers depend on it.
It currently uses
SimpleHDFWriter
internally.Parameters:  filename (str(()>str)) â
 extra (Nonedict[str,LayerBase]) â
 dump_whole_batches (bool) â dumps the whole batch as a single sequence into the HDF
 labels (list[str]None) â
 extend_existing_file (bool) â True also means we expect that it exists
 dump_per_run (bool) â write via
TFNetwork.register_run_finished_callback()

classmethod
get_out_data_from_opts
(name, sources, **kwargs)[source]Â¶ Parameters:  name (str) â
 sources (list[LayerBase]) â
Return type:

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â will modify inplace
 network (returnn.tf.network.TFNetwork) â
 > LayerBase) get_layer (((str)) â function to get or construct another layer
Image Summary LayerÂ¶

class
returnn.tf.layers.basic.
ImageSummaryLayer
(max_outputs=3, **kwargs)[source]Â¶ Creates image summaries which can be viewed in TensorBoard. This layer expects the source to be in (Tdecoder, Tencoder, B, 1).
Parameters: max_outputs â number of images to generate per step 
classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â will modify inplace, the loss_opts
 network (returnn.tf.network.TFNetwork) â
 > LayerBase) get_layer (((str)) â function to get or construct another layer

classmethod
Print LayerÂ¶
Scaled Gradient LayerÂ¶

class
returnn.tf.layers.basic.
ScaledGradientLayer
(scale, **kwargs)[source]Â¶ Just tf.identity in the forward pass. Scales the gradient by some factor in backprop. Can be used as gradient reversal layer (with negative factor). Uses
TFUtil.scaled_gradient()
, ortf.stop_gradient()
Parameters: scale (float) â if 0., will use tf.stop_gradient
Synthetic Gradient LayerÂ¶

class
returnn.tf.layers.basic.
SyntheticGradientLayer
(gradient, meta_loss_scale=1.0, **kwargs)[source]Â¶ This is a generalized way to be able to replace the true gradient with any kind of predicted gradient. This enabled to implement the idea from here:
Decoupled Neural Interfaces using Synthetic Gradients, https://arxiv.org/abs/1608.05343Parameters:  gradient (LayerBase) â
 meta_loss_scale (float) â

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer â
Loss FunctionsÂ¶
AsIs LossÂ¶
Binary CrossEntropy LossÂ¶

class
returnn.tf.layers.basic.
BinaryCrossEntropyLoss
(pos_weight=None, **kwargs)[source]Â¶ Binary cross entropy. We expect the output as logits, not in probability space! Per frame: mean(target * log(sigmoid(output)) + (1  target) * log(1  sigmoid(output)))
Parameters: pos_weight (floatNone) â weight of positive labels, see tf.nn.weighted_cross_entropy_with_logits.
Bleu LossÂ¶

class
returnn.tf.layers.basic.
BleuLoss
(**kwargs)[source]Â¶ Note that this loss is not differentiable, thus itâs only for keeping statistics. Also, BLEU is a score, i.e. the higher, the better. Thus, to interpret it as a loss or error, we take the negative value.
CrossEntropy LossÂ¶

class
returnn.tf.layers.basic.
CrossEntropyLoss
(focal_loss_factor=0.0, label_smoothing=0.0, label_smoothing_gaussian=False, debug_dump=False, safe_log_opts=None, use_fused=True, fake_upper_bound=None, **kwargs)[source]Â¶ CrossEntropy loss. Basically sum(target * log(output)).
Parameters:  focal_loss_factor (float) â see https://arxiv.org/abs/1708.02002. 0 means disabled
 label_smoothing (float) â 0.1 is a common default. see
TFUtil.smoothing_cross_entropy()
 label_smoothing_gaussian (bool) â see
TFUtil.smoothing_cross_entropy()
 debug_dump (bool) â
 safe_log_opts (dict[str]) â passed to
safe_log()
 use_fused (bool) â if possible, use fused opts
 fake_upper_bound (floatNone) â uses
TFUtil.minimum_with_identity_grad()
. I.e. you will see a finite loss, but we use the original gradient (which should be safe).
CTC LossÂ¶

class
returnn.tf.layers.basic.
CtcLoss
(target_collapse_repeated=False, auto_clip_target_len=False, output_in_log_space=False, beam_width=100, ctc_opts=None, focal_loss_factor=0.0, use_native=False, use_viterbi=False, **kwargs)[source]Â¶ Connectionist Temporal Classification (CTC) loss. Basically a wrapper around tf.nn.ctc_loss.
Parameters:  target_collapse_repeated (bool) â like preprocess_collapse_repeated option for CTC. used for sparse_labels().
 auto_clip_target_len (bool) â see self._get_target_sparse_labels().
 output_in_log_space (bool) â False > output expected in prob space. see self.get_output_logits
 beam_width (int) â used in eval
 ctc_opts (dict[str]None) â other kwargs used for tf.nn.ctc_loss
 focal_loss_factor (float) â see https://arxiv.org/abs/1708.02002. 0 means disabled. generalized for CTC
 use_native (bool) â use our native implementation (
TFNativeOp.ctc_loss()
)  use_viterbi (bool) â instead of fullsum, use only best path (via
ctc_loss_viterbi()
)
Deep Clustering LossÂ¶

class
returnn.tf.layers.basic.
DeepClusteringLoss
(embedding_dimension, nr_of_sources, **kwargs)[source]Â¶ Cost function used for deep clustering as described in [Hershey & Chen+, 2016]: âDeep clustering discriminative embeddings for segmentation and separationâ
Parameters:  embedding_dimension (int) â
 nr_of_sources (int) â
Edit Distance LossÂ¶

class
returnn.tf.layers.basic.
EditDistanceLoss
(debug_print=False, label_map=None, ctc_decode=False, output_in_log_space=False, **kwargs)[source]Â¶ Note that this loss is not differentiable, thus itâs only for keeping statistics.
Parameters:  debug_print (bool) â will tf.Print the sequence
 label_map (dict[int,int]None) â before calculating the editdistance, will apply this map
 ctc_decode (bool) â True > expects dense output and does CTC decode, False > expects sparse labels in output
 output_in_log_space (bool) â False > dense output expected in prob space. see self.get_output_logits
Expected LossÂ¶

class
returnn.tf.layers.basic.
ExpectedLoss
(loss, loss_kind, norm_scores=True, norm_scores_stop_gradient=True, divide_beam_size=True, subtract_average_loss=True, loss_correction_grad_only=False, **kwargs)[source]Â¶ This loss uses another loss error or value and given the search beam scores, calculates the expected loss. Sometimes also called minimum Bayes risk.
Parameters:  loss (Loss) â
 loss_kind (str) â âerrorâ or âvalueâ. whether to use loss.get_error() or loss.get_value()
 norm_scores (bool) â
 norm_scores_stop_gradient (bool) â
 divide_beam_size (bool) â
 subtract_average_loss (bool) â
 loss_correction_grad_only (bool) â

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer â
Extern Sprint LossÂ¶
Fast Baum Welch LossÂ¶
Generic CrossEntropy LossÂ¶
MeanL1 LossÂ¶

class
returnn.tf.layers.basic.
MeanSquaredError
(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, scale=1.0)[source]Â¶ The generic mean squared error loss function
Parameters:  base_network (returnn.tf.network.TFNetwork) â
 use_flatten_frames (bool) â will use
TFUtil.flatten_with_seq_len_mask()
 use_normalized_loss (bool) â the loss used in optimization will be normalized
 custom_norm_factor (floatfunctionNone) â
 scale (float) â additional scale factor for the loss
MeanSquaredError LossÂ¶

class
returnn.tf.layers.basic.
MeanSquaredError
(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, scale=1.0)[source] The generic mean squared error loss function
Parameters:  base_network (returnn.tf.network.TFNetwork) â
 use_flatten_frames (bool) â will use
TFUtil.flatten_with_seq_len_mask()
 use_normalized_loss (bool) â the loss used in optimization will be normalized
 custom_norm_factor (floatfunctionNone) â
 scale (float) â additional scale factor for the loss

class_name
= 'mse'[source]

get_value
()[source] Return type: tf.Tensor
L1 LossÂ¶

class
returnn.tf.layers.basic.
L1Loss
(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, scale=1.0)[source]Â¶ L1distance loss. sum(target  output).
Parameters:  base_network (returnn.tf.network.TFNetwork) â
 use_flatten_frames (bool) â will use
TFUtil.flatten_with_seq_len_mask()
 use_normalized_loss (bool) â the loss used in optimization will be normalized
 custom_norm_factor (floatfunctionNone) â
 scale (float) â additional scale factor for the loss
SamplingBased LossÂ¶

class
returnn.tf.layers.basic.
SamplingBasedLoss
(num_sampled=128, num_splits=1, sampler='log_uniform', nce_loss=False, use_full_softmax=False, remove_accidental_hits=None, sampler_args=None, nce_log_norm_term=0.0, **kwargs)[source]Â¶ Implement two sampling based losses, sampled softmax (default) and noise contrastive estimation. https://www.tensorflow.org/api_docs/python/tf/nn/sampled_softmax_loss. https://www.tensorflow.org/api_docs/python/tf/nn/nce_loss.
Must be used in an output linear layer with a weight matrix of shape (num_classes, dim). When using âlog_uniformâ sampler (default), optimal performance is typically achieved with the vocabulary list sorted in decreasing order of frequency (https://www.tensorflow.org/api_docs/python/tf/random/log_uniform_candidate_sampler).
Parameters:  num_sampled (int) â Number of classes to be sampled. For sampled softmax, this is the number of classes to be used to estimate the sampled softmax. For noise contrastive estimation, this is the number of noise samples.
 num_splits (int) â Number of different samples (each with ânum_sampledâ classes) to be used per batch.
 sampler (str) â Specify sampling distribution (âuniformâ, âlog_uniformâ, âlearned_unigramâ or âfixed_unigramâ).
 nce_loss (bool) â If True, use noise contrastive estimation loss. Else (default), use the sampled softmax.
 use_full_softmax (bool) â If True, compute the full softmax instead of sampling (can be used for evaluation).
 remove_accidental_hits (boolNone) â If True, remove sampled classes that equal one of the target classes. If not specified (None), the value is determined based on the choosen objective. For sampled softmax this should be set to True; for NCE the default is False. Set this to True in case of NCE training and the objective is equal to sampled logistic loss.
 sampler_args (dict[str]) â additional arguments for the candidate sampler. This is most relevant to the fixed_unigram sampler. See https://www.tensorflow.org/api_docs/python/tf/random/fixed_unigram_candidate_sampler for details.
 nce_log_norm_term (float) â The logarithm of the constant normalization term for NCE.
Triplet LossÂ¶

class
returnn.tf.layers.basic.
TripletLoss
(margin, multi_view_training=False, **kwargs)[source]Â¶ Triplet loss: loss = max(margin + d(x_a, x_s)  d(x_a, x_d), 0.0) Triplet loss is used for metric learning in a siamese/triplet network. It should be used as a part of CopyLayer with 3 inputs corresponding to
x_a, x_s and x_d in a loss. Here we assume that x_a are anchor samples, x_s are samples where
 at each position i in a minibatch x_ai and x_si belong to the same class, while pairs x_ai and x_di belong to different classes.
In this implementation the number of training examples is increased by extracting all possible same/different pairs within a minibatch.
Via Layer LossÂ¶

class
returnn.tf.layers.basic.
ViaLayerLoss
(error_signal_layer=None, align_layer=None, loss_wrt_to_act_in=False, **kwargs)[source]Â¶ The loss error signal and loss value is defined as the output of another layer. That way, you can define any custom loss. This could e.g. be used together with the fast_bw layer.
Parameters:  error_signal_layer (LayerBase) â
 align_layer (LayerBase) â
 loss_wrt_to_act_in (boolstr) â if True, we expect that the given output_with_activation is set, and the given error signal is w.r.t. the input of the specific activation function. A common example is the input to the softmax function, where the gradient is much more stable to define, e.g. y  z instead of y/z for cross entropy. If you specify a str, e.g. âsoftmaxâ or âlog_softmaxâ, there is an additional check that the used activation function is really that one.

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â will modify inplace, the loss_opts
 network (returnn.tf.network.TFNetwork) â
 > LayerBase) get_layer (((str)) â function to get or construct another layer
Softmax LayersÂ¶
Batched Softmax LayerÂ¶
Softmax LayerÂ¶
SoftmaxOverSpatial LayerÂ¶

class
returnn.tf.layers.basic.
SoftmaxOverSpatialLayer
(axis=None, energy_factor=None, start=None, window_start=None, window_size=None, use_time_mask=None, **kwargs)[source]Â¶ This applies a softmax over spatial axis/axes (currently only time axis supported). E.g. when the input is of shape (B,T,dim), the output will be (B,T,dim). It automatically masks the frames outside the seq defined by the seqlen. In contrast to
SoftmaxLayer
, this will not do a linear transformation. SeeSeqLenMaskLayer
if you just want to apply a masking.Parameters:  axis (strNone) â which axis to do the softmax over
 energy_factor (floatNone) â the energy will be scaled by this factor. This is like a temperature for the softmax. In Attentionisallyouneed, this is set to 1/sqrt(base_ctx.dim).
 start (LayerBaseNone) â Tensor of shape (B,) indicating the start frame
 window_start (LayerBaseNone) â Tensor of shape (B,) indicating the window start
 window_size (LayerBaseintNone) â
 use_time_mask (bool) â if True, assumes dyn seq len, and use it for masking. By default, if dyn seq len exists, it uses it.

classmethod
get_out_data_from_opts
(name, sources, axis=None, start=None, window_start=None, window_size=None, **kwargs)[source]Â¶ Parameters:  name (str) â
 sources (list[LayerBase]) â
 axis (strNone) â
 start (LayerBaseNone) â
 window_start (LayerBaseNone) â
 window_size (LayerBaseintNone) â
Return type:

classmethod
transform_config_dict
(d, network, get_layer)[source]Â¶ Parameters:  d (dict[str]) â
 network (returnn.tf.network.TFNetwork) â
 get_layer â
Recurrent UnitsÂ¶
These are the units that can be used in a TFNetworkRecLayer.RecLayer
type of layer.
Common units are:
 BasicLSTM (the cell), via official TF, pure TF implementation
 LSTMBlock (the cell), via tf.contrib.rnn.
 LSTMBlockFused, via tf.contrib.rnn. should be much faster than BasicLSTM
 CudnnLSTM, via tf.contrib.cudnn_rnn. This is experimental yet.
 NativeLSTM, our own native LSTM. should be faster than LSTMBlockFused.
 NativeLstm2, improved own native LSTM, should be the fastest and most powerful
Note that the native implementations can not be in a recurrent subnetwork, as they process the whole sequence at once. A performance comparison of the different LSTM Layers is available here.
BasicLSTMCellÂ¶

class
tensorflow.python.ops.rnn_cell_impl.
BasicLSTMCell
(num_units, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None, name=None, dtype=None, **kwargs)[source]Â¶ DEPRECATED: Please use tf.compat.v1.nn.rnn_cell.LSTMCell instead.
Basic LSTM recurrent network cell.
The implementation is based on: http://arxiv.org/abs/1409.2329.
We add forget_bias (default: 1) to the biases of the forget gate in order to reduce the scale of forgetting in the beginning of the training.
It does not allow cell clipping, a projection layer, and does not use peephole connections: it is the basic baseline.
For advanced models, please use the full tf.compat.v1.nn.rnn_cell.LSTMCell that follows.
Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU, or tf.contrib.rnn.LSTMBlockCell and tf.contrib.rnn.LSTMBlockFusedCell for better performance on CPU.
Initialize the basic LSTM cell. (deprecated)
Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
 Args:
num_units: int, The number of units in the LSTM cell. forget_bias: float, The bias added to forget gates (see above). Must set
to 0.0 manually when restoring from CudnnLSTMtrained checkpoints. state_is_tuple: If True, accepted and returned states are 2tuples of the
 c_state and m_state. If False, they are concatenated along the column axis. The latter behavior will soon be deprecated.
 activation: Activation function of the inner states. Default: tanh. It
 could also be string that is within Keras activation function names.
 reuse: (optional) Python boolean describing whether to reuse variables in
 an existing scope. If not True, and the existing scope already has the given variables, an error is raised.
 name: String, the name of the layer. Layers with the same name will share
 weights, but to avoid mistakes we require reuse=True in such cases.
 dtype: Default dtype of the layer (default of None means use the type of
 the first input). Required when build is called before call.
 **kwargs: Dict, keyword named properties for common layer attributes, like
 trainable etc when constructing the cell from configs of get_config(). When restoring from CudnnLSTMtrained checkpoints, must use CudnnCompatibleLSTMCell instead.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

call
(inputs, state)[source]Â¶ Long shortterm memory cell (LSTM).
 Args:
inputs: 2D tensor with shape [batch_size, input_size]. state: An LSTMStateTuple of state tensors, each shaped `[batch_size,
num_units]`, if state_is_tuple has been set to True. Otherwise, a Tensor shaped [batch_size, 2 * num_units]. Returns:
 A pair containing the new hidden state, and the new state (either a
 LSTMStateTuple or a concatenated state, depending on state_is_tuple).

get_config
()[source]Â¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).
 Returns:
 Python dictionary.
BasicRNNCellÂ¶

class
tensorflow.python.ops.rnn_cell_impl.
BasicRNNCell
(num_units, activation=None, reuse=None, name=None, dtype=None, **kwargs)[source]Â¶ The most basic RNN cell.
Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnRNNTanh for better performance on GPU.
 Args:
num_units: int, The number of units in the RNN cell. activation: Nonlinearity to use. Default: tanh. It could also be string
that is within Keras activation function names. reuse: (optional) Python boolean describing whether to reuse variables in an
 existing scope. If not True, and the existing scope already has the given variables, an error is raised.
 name: String, the name of the layer. Layers with the same name will share
 weights, but to avoid mistakes we require reuse=True in such cases.
 dtype: Default dtype of the layer (default of None means use the type of
 the first input). Required when build is called before call.
 **kwargs: Dict, keyword named properties for common layer attributes, like
 trainable etc when constructing the cell from configs of get_config().
DEPRECATED FUNCTION
Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be replaced by that in Tensorflow 2.0.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

get_config
()[source]Â¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).
 Returns:
 Python dictionary.
BidirectionalGridLSTMCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
BidirectionalGridLSTMCell
(num_units, use_peepholes=False, share_time_frequency_weights=False, cell_clip=None, initializer=None, num_unit_shards=1, forget_bias=1.0, feature_size=None, frequency_skip=None, num_frequency_blocks=None, start_freqindex_list=None, end_freqindex_list=None, couple_input_forget_gates=False, backward_slice_offset=0, reuse=None)[source]Â¶ Bidirectional GridLstm cell.
The bidirection connection is only used in the frequency direction, which hence doesnât affect the time directionâs realtime processing that is required for online recognition systems. The current implementation uses different weights for the two directions.
Initialize the parameters for an LSTM cell.
 Args:
num_units: int, The number of units in the LSTM cell use_peepholes: (optional) bool, default False. Set True to enable
diagonal/peephole connections. share_time_frequency_weights: (optional) bool, default False. Set True to
 enable shared cell weights between time and frequency LSTMs.
 cell_clip: (optional) A float value, default None, if provided the cell
 state is clipped by this value prior to the cell output activation.
 initializer: (optional) The initializer to use for the weight and
 projection matrices, default None.
 num_unit_shards: (optional) int, default 1, How to split the weight
 matrix. If > 1, the weight matrix is stored across num_unit_shards.
 forget_bias: (optional) float, default 1.0, The initial bias of the
 forget gates, used to reduce the scale of forgetting at the beginning of the training.
 feature_size: (optional) int, default None, The size of the input feature
 the LSTM spans over.
 frequency_skip: (optional) int, default None, The amount the LSTM filter
 is shifted by in frequency.
 num_frequency_blocks: [required] A list of frequency blocks needed to
 cover the whole input feature splitting defined by start_freqindex_list and end_freqindex_list.
 start_freqindex_list: [optional], list of ints, default None, The
 starting frequency index for each frequency block.
 end_freqindex_list: [optional], list of ints, default None. The ending
 frequency index for each frequency block.
 couple_input_forget_gates: (optional) bool, default False, Whether to
 couple the input and forget gates, i.e. f_gate = 1.0  i_gate, to reduce model parameters and computation cost.
 backward_slice_offset: (optional) int32, default 0, the starting offset to
 slice the feature for backward processing.
 reuse: (optional) Python boolean describing whether to reuse variables
 in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.

call
(inputs, state)[source]Â¶ Run one step of LSTM.
 Args:
 inputs: input Tensor, 2D, [batch, num_units]. state: tuple of Tensors, 2D, [batch, state_size].
 Returns:
A tuple containing:  A 2D, [batch, output_dim], Tensor representing the output of the LSTM
after reading âinputsâ when previous state was âstateâ. Here output_dim is num_units. A 2D, [batch, state_size], Tensor representing the new state of LSTM after reading âinputsâ when previous state was âstateâ.
 Raises:
 ValueError: if an input_size was specified and the provided inputs have
 a different dimension.
BlocksparseLSTMCellÂ¶

class
returnn.tf.layers.rec.
BlocksparseLSTMCell
(*args, **kwargs)[source]Â¶ Standard LSTM but uses OpenAI blocksparse kernels to support bigger matrices.
Refs:
It uses our own wrapper, see
TFNativeOp.init_blocksparse()
.
BlocksparseMultiplicativeMultistepLSTMCellÂ¶
Conv1DLSTMCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
Conv1DLSTMCell
(name='conv_1d_lstm_cell', **kwargs)[source]Â¶ 1D Convolutional LSTM recurrent network cell.
https://arxiv.org/pdf/1506.04214v1.pdf
Construct Conv1DLSTM. See ConvLSTMCell for more details.
Conv2DLSTMCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
Conv2DLSTMCell
(name='conv_2d_lstm_cell', **kwargs)[source]Â¶ 2D Convolutional LSTM recurrent network cell.
https://arxiv.org/pdf/1506.04214v1.pdf
Construct Conv2DLSTM. See ConvLSTMCell for more details.
Conv3DLSTMCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
Conv3DLSTMCell
(name='conv_3d_lstm_cell', **kwargs)[source]Â¶ 3D Convolutional LSTM recurrent network cell.
https://arxiv.org/pdf/1506.04214v1.pdf
Construct Conv3DLSTM. See ConvLSTMCell for more details.
ConvLSTMCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
ConvLSTMCell
(conv_ndims, input_shape, output_channels, kernel_shape, use_bias=True, skip_connection=False, forget_bias=1.0, initializers=None, name='conv_lstm_cell')[source]Â¶ Convolutional LSTM recurrent network cell.
https://arxiv.org/pdf/1506.04214v1.pdf
Construct ConvLSTMCell.
 Args:
conv_ndims: Convolution dimensionality (1, 2 or 3). input_shape: Shape of the input as int tuple, excluding the batch size. output_channels: int, number of output channels of the conv LSTM. kernel_shape: Shape of kernel as an int tuple (of size 1, 2 or 3). use_bias: (bool) Use bias in convolutions. skip_connection: If set to True, concatenate the input to the
output of the conv LSTM. Default: False.forget_bias: Forget bias. initializers: Unused. name: Name of the module.
 Raises:
 ValueError: If skip_connection is True and stride is different from 1
 or if input_shape is incompatible with conv_ndims.
CoupledInputForgetGateLSTMCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
CoupledInputForgetGateLSTMCell
(num_units, use_peepholes=False, initializer=None, num_proj=None, proj_clip=None, num_unit_shards=1, num_proj_shards=1, forget_bias=1.0, state_is_tuple=True, activation=<function tanh>, reuse=None, layer_norm=False, norm_gain=1.0, norm_shift=0.0)[source]Â¶ Long shortterm memory unit (LSTM) recurrent network cell.
The default nonpeephole implementation is based on:
Felix Gers, Jurgen Schmidhuber, and Fred Cummins. âLearning to forget: Continual prediction with LSTM.â IET, 850855, 1999.
The peephole implementation is based on:
Hasim Sak, Andrew Senior, and Francoise Beaufays. âLong shortterm memory recurrent neural network architectures for
large scale acoustic modeling.â INTERSPEECH, 2014.The coupling of input and forget gate is based on:
Greff et al. âLSTM: A Search Space Odysseyâ
The class uses optional peephole connections, and an optional projection layer. Layer normalization implementation is based on:
âLayer Normalizationâ Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
and is applied before the internal nonlinearities.
Initialize the parameters for an LSTM cell.
 Args:
num_units: int, The number of units in the LSTM cell use_peepholes: bool, set True to enable diagonal/peephole connections. initializer: (optional) The initializer to use for the weight and
projection matrices. num_proj: (optional) int, The output dimensionality for the projection
 matrices. If None, no projection is performed.
proj_clip: (optional) A float value. If num_proj > 0 and proj_clip is provided, then the projected values are clipped elementwise to within [proj_clip, proj_clip]. num_unit_shards: How to split the weight matrix. If >1, the weight
matrix is stored across num_unit_shards. num_proj_shards: How to split the projection matrix. If >1, the
 projection matrix is stored across num_proj_shards.
 forget_bias: Biases of the forget gate are initialized by default to 1
 in order to reduce the scale of forgetting at the beginning of the training.
 state_is_tuple: If True, accepted and returned states are 2tuples of
 the c_state and m_state. By default (False), they are concatenated along the column axis. This default behavior will soon be deprecated.
activation: Activation function of the inner states. reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.layer_norm: If True, layer normalization will be applied. norm_gain: float, The layer normalization gain initial value. If
layer_norm has been set to False, this argument will be ignored. norm_shift: float, The layer normalization shift initial value. If
 layer_norm has been set to False, this argument will be ignored.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

call
(inputs, state)[source]Â¶ Run one step of LSTM.
 Args:
inputs: input Tensor, 2D, batch x num_units. state: if state_is_tuple is False, this must be a state Tensor,
2D, batch x state_size. If state_is_tuple is True, this must be a tuple of state Tensors, both 2D, with column sizes c_state and m_state. Returns:
A tuple containing:  A 2D, [batch x output_dim], Tensor representing the output of the
LSTM after reading inputs when previous state was state. Here output_dim is:
num_proj if num_proj was set, num_units otherwise. Tensor(s) representing the new state of LSTM after reading inputs when the previous state was state. Same type and shape(s) as state.
 Raises:
 ValueError: If input size cannot be inferred from inputs via
 static shape inference.
GLSTMCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
GLSTMCell
(num_units, initializer=None, num_proj=None, number_of_groups=1, forget_bias=1.0, activation=<function tanh>, reuse=None)[source]Â¶ Group LSTM cell (GLSTM).
The implementation is based on:
O. Kuchaiev and B. Ginsburg âFactorization Tricks for LSTM Networksâ, ICLR 2017 workshop.
In brief, a GLSTM cell consists of one LSTM subcell per group, where each subcell operates on an evenlysized subvector of the input and produces an evenlysized subvector of the output. For example, a GLSTM cell with 128 units and 4 groups consists of 4 LSTMs subcells with 32 units each. If that GLSTM cell is fed a 200dim input, then each subcell receives a 50dim part of the input and produces a 32dim part of the output.
Initialize the parameters of GLSTM cell.
 Args:
num_units: int, The number of units in the GLSTM cell initializer: (optional) The initializer to use for the weight and
projection matrices. num_proj: (optional) int, The output dimensionality for the projection
 matrices. If None, no projection is performed.
 number_of_groups: (optional) int, number of groups to use.
 If number_of_groups is 1, then it should be equivalent to LSTM cell
 forget_bias: Biases of the forget gate are initialized by default to 1
 in order to reduce the scale of forgetting at the beginning of the training.
activation: Activation function of the inner states. reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not True, and the existing scope already has the given variables, an error is raised. Raises:
 ValueError: If num_units or num_proj is not divisible by
 number_of_groups.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

call
(inputs, state)[source]Â¶ Run one step of GLSTM.
 Args:
 inputs: input Tensor, 2D, [batch x num_inputs]. num_inputs must be
 staticallyknown and evenly divisible into groups. The innermost vectors of the inputs are split into evenlysized subvectors and fed into the pergroup LSTM subcells.
 state: this must be a tuple of state Tensors, both 2D, with column
 sizes c_state and m_state.
 Returns:
A tuple containing:
A 2D, [batch x output_dim], Tensor representing the output of the GLSTM after reading inputs when previous state was state. Here output_dim is:
num_proj if num_proj was set, num_units otherwise.
LSTMStateTuple representing the new state of GLSTM cell after reading inputs when the previous state was state.
 Raises:
 ValueError: If input size cannot be inferred from inputs via
 static shape inference, or if the input shape is incompatible with the number of groups.
GridLSTMCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
GridLSTMCell
(num_units, use_peepholes=False, share_time_frequency_weights=False, cell_clip=None, initializer=None, num_unit_shards=1, forget_bias=1.0, feature_size=None, frequency_skip=None, num_frequency_blocks=None, start_freqindex_list=None, end_freqindex_list=None, couple_input_forget_gates=False, state_is_tuple=True, reuse=None)[source]Â¶ Grid Long shortterm memory unit (LSTM) recurrent network cell.
 The default is based on:
 Nal Kalchbrenner, Ivo Danihelka and Alex Graves âGrid Long ShortTerm Memory,â Proc. ICLR 2016. http://arxiv.org/abs/1507.01526
 When peephole connections are used, the implementation is based on:
 Tara N. Sainath and Bo Li âModeling TimeFrequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks.â submitted to INTERSPEECH, 2016.
The code uses optional peephole connections, shared_weights and cell clipping.
Initialize the parameters for an LSTM cell.
 Args:
num_units: int, The number of units in the LSTM cell use_peepholes: (optional) bool, default False. Set True to enable
diagonal/peephole connections. share_time_frequency_weights: (optional) bool, default False. Set True to
 enable shared cell weights between time and frequency LSTMs.
 cell_clip: (optional) A float value, default None, if provided the cell
 state is clipped by this value prior to the cell output activation.
 initializer: (optional) The initializer to use for the weight and
 projection matrices, default None.
 num_unit_shards: (optional) int, default 1, How to split the weight
 matrix. If > 1, the weight matrix is stored across num_unit_shards.
 forget_bias: (optional) float, default 1.0, The initial bias of the
 forget gates, used to reduce the scale of forgetting at the beginning of the training.
 feature_size: (optional) int, default None, The size of the input feature
 the LSTM spans over.
 frequency_skip: (optional) int, default None, The amount the LSTM filter
 is shifted by in frequency.
 num_frequency_blocks: [required] A list of frequency blocks needed to
 cover the whole input feature splitting defined by start_freqindex_list and end_freqindex_list.
 start_freqindex_list: [optional], list of ints, default None, The
 starting frequency index for each frequency block.
 end_freqindex_list: [optional], list of ints, default None. The ending
 frequency index for each frequency block.
 couple_input_forget_gates: (optional) bool, default False, Whether to
 couple the input and forget gates, i.e. f_gate = 1.0  i_gate, to reduce model parameters and computation cost.
 state_is_tuple: If True, accepted and returned states are 2tuples of
 the c_state and m_state. By default (False), they are concatenated along the column axis. This default behavior will soon be deprecated.
 reuse: (optional) Python boolean describing whether to reuse variables
 in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.
 Raises:
 ValueError: if the num_frequency_blocks list is not specified

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

call
(inputs, state)[source]Â¶ Run one step of LSTM.
 Args:
inputs: input Tensor, 2D, [batch, feature_size]. state: Tensor or tuple of Tensors, 2D, [batch, state_size], depends on the
flag self._state_is_tuple. Returns:
A tuple containing:  A 2D, [batch, output_dim], Tensor representing the output of the LSTM
after reading âinputsâ when previous state was âstateâ. Here output_dim is num_units. A 2D, [batch, state_size], Tensor representing the new state of LSTM after reading âinputsâ when previous state was âstateâ.
 Raises:
 ValueError: if an input_size was specified and the provided inputs have
 a different dimension.
GRUCellÂ¶

class
tensorflow.python.ops.rnn_cell_impl.
GRUCell
(num_units, activation=None, reuse=None, kernel_initializer=None, bias_initializer=None, name=None, dtype=None, **kwargs)[source]Â¶ Gated Recurrent Unit cell (cf.
http://arxiv.org/abs/1406.1078).
Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnGRU for better performance on GPU, or tf.contrib.rnn.GRUBlockCellV2 for better performance on CPU.
 Args:
num_units: int, The number of units in the GRU cell. activation: Nonlinearity to use. Default: tanh. reuse: (optional) Python boolean describing whether to reuse variables in an
existing scope. If not True, and the existing scope already has the given variables, an error is raised. kernel_initializer: (optional) The initializer to use for the weight and
 projection matrices.
bias_initializer: (optional) The initializer to use for the bias. name: String, the name of the layer. Layers with the same name will share
weights, but to avoid mistakes we require reuse=True in such cases. dtype: Default dtype of the layer (default of None means use the type of
 the first input). Required when build is called before call.
 **kwargs: Dict, keyword named properties for common layer attributes, like
 trainable etc when constructing the cell from configs of get_config().
DEPRECATED FUNCTION
Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.GRUCell, and will be replaced by that in Tensorflow 2.0.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

get_config
()[source]Â¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).
 Returns:
 Python dictionary.
GRUBlockCellÂ¶

class
tensorflow.contrib.rnn.python.ops.gru_ops.
GRUBlockCell
(num_units=None, cell_size=None, reuse=None, name='gru_cell')[source]Â¶ Block GRU cell implementation.
Deprecated: use GRUBlockCellV2 instead.
The implementation is based on: http://arxiv.org/abs/1406.1078 Computes the GRU cell forward propagation for 1 time step.
This kernel op implements the following mathematical equations:
Biases are initialized with:
 b_ru  constant_initializer(1.0)
 b_c  constant_initializer(0.0)
``` x_h_prev = [x, h_prev]
[r_bar u_bar] = x_h_prev * w_ru + b_ru
r = sigmoid(r_bar) u = sigmoid(u_bar)
h_prevr = h_prev circ r
x_h_prevr = [x h_prevr]
c_bar = x_h_prevr * w_c + b_c c = tanh(c_bar)
h = (1u) circ c + u circ h_prev ```
Initialize the Block GRU cell. (deprecated arguments)
Warning: SOME ARGUMENTS ARE DEPRECATED: (cell_size). They will be removed in a future version. Instructions for updating: cell_size is deprecated, use num_units instead
 Args:
num_units: int, The number of units in the GRU cell. cell_size: int, The old (deprecated) name for num_units. reuse: (optional) boolean describing whether to reuse variables in an
existing scope. If not True, and the existing scope already has the given variables, an error is raised. name: String, the name of the layer. Layers with the same name will
 share weights, but to avoid mistakes we require reuse=True in such cases. By default this is âlstm_cellâ, for variablename compatibility with tf.compat.v1.nn.rnn_cell.GRUCell.
 Raises:
 ValueError: if both cell_size and num_units are not None;
 or both are None.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

build
(input_shape)[source]Â¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of Layer or Model can override if they need a statecreation step inbetween layer instantiation and layer call.
This is typically used to create the weights of Layer subclasses.
 Arguments:
 input_shape: Instance of TensorShape, or list of instances of
 TensorShape if the layer expects a list of inputs (one instance per input).
IndRNNCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
IndRNNCell
(num_units, activation=None, reuse=None, name=None, dtype=None)[source]Â¶  Independently Recurrent Neural Network (IndRNN) cell
 (cf. https://arxiv.org/abs/1803.04831).
 Args:
num_units: int, The number of units in the RNN cell. activation: Nonlinearity to use. Default: tanh. reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not True, and the existing scope already has the given variables, an error is raised. name: String, the name of the layer. Layers with the same name will
 share weights, but to avoid mistakes we require reuse=True in such cases.
 dtype: Default dtype of the layer (default of None means use the type
 of the first input). Required when build is called before call.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

build
(inputs_shape)[source]Â¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of Layer or Model can override if they need a statecreation step inbetween layer instantiation and layer call.
This is typically used to create the weights of Layer subclasses.
 Arguments:
 input_shape: Instance of TensorShape, or list of instances of
 TensorShape if the layer expects a list of inputs (one instance per input).
IndyGRUCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
IndyGRUCell
(num_units, activation=None, reuse=None, kernel_initializer=None, bias_initializer=None, name=None, dtype=None)[source]Â¶ Independently Gated Recurrent Unit cell.
Based on IndRNNs (https://arxiv.org/abs/1803.04831) and similar to GRUCell, yet with the \(U_r\), \(U_z\), and \(U\) matrices in equations 5, 6, and 8 of http://arxiv.org/abs/1406.1078 respectively replaced by diagonal matrices, i.e. a Hadamard product with a single vector:
 $$r_j = sigmaleft([mathbf W_rmathbf x]_j +
 [mathbf u_rcirc mathbf h_{(t1)}]_jright)$$
 $$z_j = sigmaleft([mathbf W_zmathbf x]_j +
 [mathbf u_zcirc mathbf h_{(t1)}]_jright)$$
 $$tilde{h}^{(t)}_j = phileft([mathbf W mathbf x]_j +
 [mathbf u circ mathbf r circ mathbf h_{(t1)}]_jright)$$
where \(circ\) denotes the Hadamard operator. This means that each IndyGRU node sees only its own state, as opposed to seeing all states in the same layer.
 Args:
num_units: int, The number of units in the GRU cell. activation: Nonlinearity to use. Default: tanh. reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not True, and the existing scope already has the given variables, an error is raised. kernel_initializer: (optional) The initializer to use for the weight
 matrices applied to the input.
bias_initializer: (optional) The initializer to use for the bias. name: String, the name of the layer. Layers with the same name will
share weights, but to avoid mistakes we require reuse=True in such cases. dtype: Default dtype of the layer (default of None means use the type
 of the first input). Required when build is called before call.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

build
(inputs_shape)[source]Â¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of Layer or Model can override if they need a statecreation step inbetween layer instantiation and layer call.
This is typically used to create the weights of Layer subclasses.
 Arguments:
 input_shape: Instance of TensorShape, or list of instances of
 TensorShape if the layer expects a list of inputs (one instance per input).
IndyLSTMCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
IndyLSTMCell
(num_units, forget_bias=1.0, activation=None, reuse=None, kernel_initializer=None, bias_initializer=None, name=None, dtype=None)[source]Â¶ Basic IndyLSTM recurrent network cell.
Based on IndRNNs (https://arxiv.org/abs/1803.04831) and similar to BasicLSTMCell, yet with the \(U_f\), \(U_i\), \(U_o\) and \(U_c\) matrices in the regular LSTM equations replaced by diagonal matrices, i.e. a Hadamard product with a single vector:
$$f_t = sigma_gleft(W_f x_t + u_f circ h_{t1} + b_fright)$$ $$i_t = sigma_gleft(W_i x_t + u_i circ h_{t1} + b_iright)$$ $$o_t = sigma_gleft(W_o x_t + u_o circ h_{t1} + b_oright)$$ $$c_t = f_t circ c_{t1} +
i_t circ sigma_cleft(W_c x_t + u_c circ h_{t1} + b_cright)$$where \(circ\) denotes the Hadamard operator. This means that each IndyLSTM node sees only its own state \(h\) and \(c\), as opposed to seeing all states in the same layer.
We add forget_bias (default: 1) to the biases of the forget gate in order to reduce the scale of forgetting in the beginning of the training.
It does not allow cell clipping, a projection layer, and does not use peephole connections: it is the basic baseline.
For a detailed analysis of IndyLSTMs, see https://arxiv.org/abs/1903.08023.
Initialize the IndyLSTM cell.
 Args:
num_units: int, The number of units in the LSTM cell. forget_bias: float, The bias added to forget gates (see above).
Must set to 0.0 manually when restoring from CudnnLSTMtrained checkpoints.activation: Activation function of the inner states. Default: tanh. reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not True, and the existing scope already has the given variables, an error is raised. kernel_initializer: (optional) The initializer to use for the weight
 matrix applied to the inputs.
bias_initializer: (optional) The initializer to use for the bias. name: String, the name of the layer. Layers with the same name will
share weights, but to avoid mistakes we require reuse=True in such cases. dtype: Default dtype of the layer (default of None means use the type
 of the first input). Required when build is called before call.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

build
(inputs_shape)[source]Â¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of Layer or Model can override if they need a statecreation step inbetween layer instantiation and layer call.
This is typically used to create the weights of Layer subclasses.
 Arguments:
 input_shape: Instance of TensorShape, or list of instances of
 TensorShape if the layer expects a list of inputs (one instance per input).

call
(inputs, state)[source]Â¶ Independent Long shortterm memory cell (IndyLSTM).
 Args:
inputs: 2D tensor with shape [batch_size, input_size]. state: An LSTMStateTuple of state tensors, each shaped
[batch_size, num_units]. Returns:
 A pair containing the new hidden state, and the new state (a
 LSTMStateTuple).
IntersectionRNNCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
IntersectionRNNCell
(num_units, num_in_proj=None, initializer=None, forget_bias=1.0, y_activation=<function relu>, reuse=None)[source]Â¶ Intersection Recurrent Neural Network (+RNN) cell.
Architecture with coupled recurrent gate as well as coupled depth gate, designed to improve information flow through stacked RNNs. As the architecture uses depth gating, the dimensionality of the depth output (y) also should not change through depth (input size == output size). To achieve this, the first layer of a stacked Intersection RNN projects the inputs to N (num units) dimensions. Therefore when initializing an IntersectionRNNCell, one should set num_in_proj = N for the first layer and use default settings for subsequent layers.
This implements the recurrent cell from the paper:
Jasmine Collins, Jascha SohlDickstein, and David Sussillo. âCapacity and Trainability in Recurrent Neural Networksâ Proc. ICLR 2017.
The Intersection RNN is built for use in deeply stacked RNNs so it may not achieve best performance with depth 1.
Initialize the parameters for an +RNN cell.
 Args:
num_units: int, The number of units in the +RNN cell num_in_proj: (optional) int, The input dimensionality for the RNN.
If creating the first layer of an +RNN, this should be set to num_units. Otherwise, this should be set to None (default). If None, dimensionality of inputs should be equal to num_units, otherwise ValueError is thrown.initializer: (optional) The initializer to use for the weight matrices. forget_bias: (optional) float, default 1.0, The initial bias of the
forget gates, used to reduce the scale of forgetting at the beginning of the training. y_activation: (optional) Activation function of the states passed
 through depth. Default is âtf.nn.relu`.
 reuse: (optional) Python boolean describing whether to reuse variables
 in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

call
(inputs, state)[source]Â¶ Run one step of the Intersection RNN.
 Args:
 inputs: input Tensor, 2D, batch x input size. state: state Tensor, 2D, batch x num units.
 Returns:
 new_y: batch x num units, Tensor representing the output of the +RNN
 after reading inputs when previous state was state.
 new_state: batch x num units, Tensor representing the state of the +RNN
 after reading inputs when previous state was state.
 Raises:
 ValueError: If input size cannot be inferred from inputs via
 static shape inference.
 ValueError: If input size != output size (these must be equal when
 using the Intersection RNN).
LayerNormBasicLSTMCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
LayerNormBasicLSTMCell
(num_units, forget_bias=1.0, input_size=None, activation=<function tanh>, layer_norm=True, norm_gain=1.0, norm_shift=0.0, dropout_keep_prob=1.0, dropout_prob_seed=None, reuse=None)[source]Â¶ LSTM unit with layer normalization and recurrent dropout.
This class adds layer normalization and recurrent dropout to a basic LSTM unit. Layer normalization implementation is based on:
âLayer Normalizationâ Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
and is applied before the internal nonlinearities. Recurrent dropout is base on:
âRecurrent Dropout without Memory Lossâ Stanislau Semeniuta, Aliaksei Severyn, Erhardt Barth.
Initializes the basic LSTM cell.
 Args:
num_units: int, The number of units in the LSTM cell. forget_bias: float, The bias added to forget gates (see above). input_size: Deprecated and unused. activation: Activation function of the inner states. layer_norm: If True, layer normalization will be applied. norm_gain: float, The layer normalization gain initial value. If
layer_norm has been set to False, this argument will be ignored. norm_shift: float, The layer normalization shift initial value. If
 layer_norm has been set to False, this argument will be ignored.
 dropout_keep_prob: unit Tensor or float between 0 and 1 representing the
 recurrent dropout probability value. If float and 1.0, no dropout will be applied.
dropout_prob_seed: (optional) integer, the randomness seed. reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.
LayerNormVariantsLSTMCellÂ¶

class
returnn.tf.layers.rec.
LayerNormVariantsLSTMCell
(num_units, norm_gain=1.0, norm_shift=0.0, forget_bias=0.0, activation=<function tanh>, is_training=None, dropout=0.0, dropout_h=0.0, dropout_seed=None, with_concat=False, global_norm=True, global_norm_joined=False, per_gate_norm=False, cell_norm=True, cell_norm_in_output=True, hidden_norm=False, variance_epsilon=1e12)[source]Â¶ LSTM unit with layer normalization and recurrent dropout
This LSTM cell can apply different variants of layer normalization:
1. Layer normalization as in the original paper: Ref: https://arxiv.org/abs/1607.06450 This can be applied by having:
all default params (global_norm=True, cell_norm=True, cell_norm_in_output=True)2. Layer normalization for RNMT+: Ref: https://arxiv.org/abs/1804.09849 This can be applied by having:
all default params except  global_norm = False  per_gate_norm = True  cell_norm_in_output = False3. TF official LayerNormBasicLSTMCell Ref: https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LayerNormBasicLSTMCell This can be reproduced by having:
all default params except  global_norm = False  per_gate_norm = True4. Sockeye LSTM layer normalization implementations Ref: https://github.com/awslabs/sockeye/blob/master/sockeye/rnn.py
 LayerNormLSTMCell can be reproduced by having:
 all default params except  with_concat = False (just efficiency, no difference in the model)
 LayerNormPerGateLSTMCell can be reproduced by having:
 all default params except: ( with_concat = False)  global_norm = False  per_gate_norm = True
 Recurrent dropout is based on:
 https://arxiv.org/abs/1603.05118
Prohibited LN combinations:  global_norm and global_norm_joined both enabled  per_gate_norm with global_norm or global_norm_joined
Parameters:  num_units (int) â number of lstm units
 norm_gain (float) â layer normalization gain value
 norm_shift (float) â layer normalization shift (bias) value
 forget_bias (float) â the bias added to forget gates
 activation â Activation function to be applied in the lstm cell
 is_training (bool) â if True then we are in the training phase
 dropout (float) â dropout rate, applied on cellin (j)
 dropout_h (float) â dropout rate, applied on hidden state (h) when it enters the LSTM (variational dropout)
 dropout_seed (int) â used to create random seeds
 with_concat (bool) â if True then the input and prev hidden state is concatenated for the computation. this is just about computation performance.
 global_norm (bool) â if True then layer normalization is applied for the forward and recurrent outputs (separately).
 global_norm_joined (bool) â if True, then layer norm is applied on LSTM in (forward and recurrent output together)
 per_gate_norm (bool) â if True then layer normalization is applied per lstm gate
 cell_norm (bool) â if True then layer normalization is applied to the LSTM new cell output
 cell_norm_in_output (bool) â if True, the normalized cell is also used in the output
 hidden_norm (bool) â if True then layer normalization is applied to the LSTM new hidden state output
LayerRNNCellÂ¶

class
tensorflow.python.ops.rnn_cell_impl.
LayerRNNCell
(trainable=True, name=None, dtype=None, **kwargs)[source]Â¶ Subclass of RNNCells that act like proper tf.Layer objects.
For backwards compatibility purposes, most RNNCell instances allow their call methods to instantiate variables via tf.compat.v1.get_variable. The underlying variable scope thus keeps track of any variables, and returning cached versions. This is atypical of tf.layer objects, which separate this part of layer building into a build method that is only called once.
Here we provide a subclass for RNNCell objects that act exactly as Layer objects do. They must provide a build method and their call methods do not access Variables tf.compat.v1.get_variable.
NativeLstmCellÂ¶

class
returnn.tf.native_op.
NativeLstmCell
(**kwargs)[source]Â¶ Native LSTM.

classmethod
map_layer_inputs_to_op
(z, rec_weights, i, initial_state=None)[source]Â¶ Just like NativeOp.LstmGenericBase.map_layer_inputs_to_op().
Parameters:  z (tf.Tensor) â Z: inputs: shape (time,batch,n_hidden*4)
 rec_weights (tf.Tensor) â V_h / W_re: shape (n_hidden,n_hidden*4)
 i (tf.Tensor) â index: shape (time,batch)
 initial_state (tf.TensorNone) â shape (batch,n_hidden)
Return type: (tf.Tensor,tf.Tensor,tf.Tensor,tf.Tensor)

classmethod
LSTMBlockCellÂ¶

class
tensorflow.contrib.rnn.python.ops.lstm_ops.
LSTMBlockCell
(num_units, forget_bias=1.0, cell_clip=None, use_peephole=False, dtype=None, reuse=None, name='lstm_cell')[source]Â¶ Basic LSTM recurrent network cell.
The implementation is based on: http://arxiv.org/abs/1409.2329.
We add forget_bias (default: 1) to the biases of the forget gate in order to reduce the scale of forgetting in the beginning of the training.
Unlike rnn_cell_impl.LSTMCell, this is a monolithic op and should be much faster. The weight and bias matrices should be compatible as long as the variable scope matches.
Initialize the basic LSTM cell.
 Args:
num_units: int, The number of units in the LSTM cell. forget_bias: float, The bias added to forget gates (see above). cell_clip: An optional float. Defaults to 1 (no clipping). use_peephole: Whether to use peephole connections or not. dtype: the variable dtype of this layer. Default to tf.float32. reuse: (optional) boolean describing whether to reuse variables in an
existing scope. If not True, and the existing scope already has the given variables, an error is raised. name: String, the name of the layer. Layers with the same name will
 share weights, but to avoid mistakes we require reuse=True in such cases. By default this is âlstm_cellâ, for variablename compatibility with tf.compat.v1.nn.rnn_cell.LSTMCell.
When restoring from CudnnLSTMtrained checkpoints, must use CudnnCompatibleLSTMBlockCell instead.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

build
(inputs_shape)[source]Â¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of Layer or Model can override if they need a statecreation step inbetween layer instantiation and layer call.
This is typically used to create the weights of Layer subclasses.
 Arguments:
 input_shape: Instance of TensorShape, or list of instances of
 TensorShape if the layer expects a list of inputs (one instance per input).
MultiRNNCellÂ¶

class
tensorflow.python.ops.rnn_cell_impl.
MultiRNNCell
(cells, state_is_tuple=True)[source]Â¶ RNN cell composed sequentially of multiple simple cells.
Example:
`python num_units = [128, 64] cells = [BasicLSTMCell(num_units=n) for n in num_units] stacked_rnn_cell = MultiRNNCell(cells) `
Create a RNN cell composed sequentially of a number of RNNCells. (deprecated)
Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0.
 Args:
cells: list of RNNCells that will be composed in this order. state_is_tuple: If True, accepted and returned states are ntuples, where
n = len(cells). If False, the states are all concatenated along the column axis. This latter behavior will soon be deprecated. Raises:
 ValueError: if cells is empty (not allowed), or at least one of the cells
 returns a state tuple but the flag state_is_tuple is False.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

zero_state
(batch_size, dtype)[source]Â¶ Return zerofilled state tensor(s).
 Args:
 batch_size: int, float, or unit Tensor representing the batch size. dtype: the data type to use for the state.
 Returns:
If state_size is an int or TensorShape, then the return value is a ND tensor of shape [batch_size, state_size] filled with zeros.
If state_size is a nested list or tuple, then the return value is a nested list or tuple (of the same structure) of 2D tensors with the shapes [batch_size, s] for each s in state_size.
NASCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
NASCell
(num_units, num_proj=None, use_bias=False, reuse=None, **kwargs)[source]Â¶ Neural Architecture Search (NAS) recurrent network cell.
This implements the recurrent cell from the paper:
Barret Zoph and Quoc V. Le. âNeural Architecture Search with Reinforcement Learningâ Proc. ICLR 2017.
The class uses an optional projection layer.
Initialize the parameters for a NAS cell.
 Args:
num_units: int, The number of units in the NAS cell. num_proj: (optional) int, The output dimensionality for the projection
matrices. If None, no projection is performed. use_bias: (optional) bool, If True then use biases within the cell. This
 is False by default.
 reuse: (optional) Python boolean describing whether to reuse variables
 in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.
**kwargs: Additional keyword arguments.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

build
(inputs_shape)[source]Â¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of Layer or Model can override if they need a statecreation step inbetween layer instantiation and layer call.
This is typically used to create the weights of Layer subclasses.
 Arguments:
 input_shape: Instance of TensorShape, or list of instances of
 TensorShape if the layer expects a list of inputs (one instance per input).

call
(inputs, state)[source]Â¶ Run one step of NAS Cell.
 Args:
inputs: input Tensor, 2D, batch x num_units. state: This must be a tuple of state Tensors, both 2D, with column
sizes c_state and m_state. Returns:
A tuple containing:  A 2D, [batch x output_dim], Tensor representing the output of the
NAS Cell after reading inputs when previous state was state. Here output_dim is:
num_proj if num_proj was set, num_units otherwise. Tensor(s) representing the new state of NAS Cell after reading inputs when the previous state was state. Same type and shape(s) as state.
 Raises:
 ValueError: If input size cannot be inferred from inputs via
 static shape inference.
NativeLstmCellÂ¶

class
returnn.tf.native_op.
NativeLstmCell
(**kwargs)[source] Native LSTM.

classmethod
map_layer_inputs_to_op
(z, rec_weights, i, initial_state=None)[source] Just like NativeOp.LstmGenericBase.map_layer_inputs_to_op().
Parameters:  z (tf.Tensor) â Z: inputs: shape (time,batch,n_hidden*4)
 rec_weights (tf.Tensor) â V_h / W_re: shape (n_hidden,n_hidden*4)
 i (tf.Tensor) â index: shape (time,batch)
 initial_state (tf.TensorNone) â shape (batch,n_hidden)
Return type: (tf.Tensor,tf.Tensor,tf.Tensor,tf.Tensor)

classmethod
NativeLstmLowMemCellÂ¶

class
returnn.tf.native_op.
NativeLstmLowMemCell
(**kwargs)[source]Â¶ Native LSTM, low mem variant.

map_layer_inputs_to_op
(x, weights, b, i, initial_state=None)[source]Â¶ Just like NativeOp.LstmGenericBase.map_layer_inputs_to_op(). :param tf.Tensor x: inputs: shape (time,batch,n_input_dim) :param tf.Tensor weights: shape (n_input_dim+n_hidden,n_hidden*4) :param tf.Tensor b: shape (n_hidden*4,) :param tf.Tensor i: index: shape (time,batch) :param tf.TensorNone initial_state: shape (batch,n_hidden) :rtype: tuple[tf.Tensor]

PhasedLSTMCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
PhasedLSTMCell
(num_units, use_peepholes=False, leak=0.001, ratio_on=0.1, trainable_ratio_on=True, period_init_min=1.0, period_init_max=1000.0, reuse=None)[source]Â¶ Phased LSTM recurrent network cell.
https://arxiv.org/pdf/1610.09513v1.pdf
Initialize the Phased LSTM cell.
 Args:
num_units: int, The number of units in the Phased LSTM cell. use_peepholes: bool, set True to enable peephole connections. leak: float or scalar float Tensor with value in [0, 1]. Leak applied
during training. ratio_on: float or scalar float Tensor with value in [0, 1]. Ratio of the
 period during which the gates are open.
trainable_ratio_on: bool, weather ratio_on is trainable. period_init_min: float or scalar float Tensor. With value > 0.
Minimum value of the initialized period. The period values are initialized by drawing from the distribution: e^U(log(period_init_min), log(period_init_max)) Where U(.,.) is the uniform distribution. period_init_max: float or scalar float Tensor.
 With value > period_init_min. Maximum value of the initialized period.
 reuse: (optional) Python boolean describing whether to reuse variables
 in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

call
(inputs, state)[source]Â¶ Phased LSTM Cell.
 Args:
 inputs: A tuple of 2 Tensor.
 The first Tensor has shape [batch, 1], and type float32 or float64. It stores the time. The second Tensor has shape [batch, features_size], and type float32. It stores the features.
state: rnn_cell_impl.LSTMStateTuple, state from previous timestep.
 Returns:
A tuple containing:  A Tensor of float32, and shape [batch_size, num_units], representing the
output of the cell. A rnn_cell_impl.LSTMStateTuple, containing 2 Tensors of float32, shape [batch_size, num_units], representing the new state and the output.
RHNCellÂ¶

class
returnn.tf.layers.rec.
RHNCell
(num_units, is_training=None, depth=5, dropout=0.0, dropout_seed=None, transform_bias=None, batch_size=None)[source]Â¶ Recurrent Highway Layer. With optional dropout for recurrent state (fixed over all frames  some call this variational).
 References:
 https://github.com/julian121266/RecurrentHighwayNetworks/ https://arxiv.org/abs/1607.03474
Parameters:  num_units (int) â
 is_training (booltf.TensorNone) â
 depth (int) â
 dropout (float) â
 dropout_seed (int) â
 transform_bias (floatNone) â
 batch_size (inttf.TensorNone) â
RNNCellÂ¶

class
tensorflow.python.ops.rnn_cell_impl.
RNNCell
(trainable=True, name=None, dtype=None, **kwargs)[source]Â¶ Abstract object representing an RNN cell.
Every RNNCell must have the properties below and implement call with the signature (output, next_state) = call(input, state). The optional third input argument, scope, is allowed for backwards compatibility purposes; but should be left off for new subclasses.
This definition of cell differs from the definition used in the literature. In the literature, âcellâ refers to an object with a single scalar output. This definition refers to a horizontal array of such units.
An RNN cell, in the most abstract setting, is anything that has a state and performs some operation that takes a matrix of inputs. This operation results in an output matrix with self.output_size columns. If self.state_size is an integer, this operation also results in a new state matrix with self.state_size columns. If self.state_size is a (possibly nested tuple of) TensorShape object(s), then it should return a matching structure of Tensors having shape [batch_size].concatenate(s) for each s in self.batch_size.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

build
(_)[source]Â¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of Layer or Model can override if they need a statecreation step inbetween layer instantiation and layer call.
This is typically used to create the weights of Layer subclasses.
 Arguments:
 input_shape: Instance of TensorShape, or list of instances of
 TensorShape if the layer expects a list of inputs (one instance per input).

zero_state
(batch_size, dtype)[source]Â¶ Return zerofilled state tensor(s).
 Args:
 batch_size: int, float, or unit Tensor representing the batch size. dtype: the data type to use for the state.
 Returns:
If state_size is an int or TensorShape, then the return value is a ND tensor of shape [batch_size, state_size] filled with zeros.
If state_size is a nested list or tuple, then the return value is a nested list or tuple (of the same structure) of 2D tensors with the shapes [batch_size, s] for each s in state_size.

get_config
()[source]Â¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).
 Returns:
 Python dictionary.

SRUCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
SRUCell
(num_units, activation=None, reuse=None, name=None, **kwargs)[source]Â¶ SRU, Simple Recurrent Unit.
Implementation based on Training RNNs as Fast as CNNs (cf. https://arxiv.org/abs/1709.02755).
This variation of RNN cell is characterized by the simplified data dependence between hidden states of two consecutive time steps. Traditionally, hidden states from a cell at time step t1 needs to be multiplied with a matrix W_hh before being fed into the ensuing cell at time step t. This flavor of RNN replaces the matrix multiplication between h_{t1} and W_hh with a pointwise multiplication, resulting in performance gain.
 Args:
num_units: int, The number of units in the SRU cell. activation: Nonlinearity to use. Default: tanh. reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not True, and the existing scope already has the given variables, an error is raised. name: (optional) String, the name of the layer. Layers with the same name
 will share weights, but to avoid mistakes we require reuse=True in such cases.
**kwargs: Additional keyword arguments.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

build
(inputs_shape)[source]Â¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of Layer or Model can override if they need a statecreation step inbetween layer instantiation and layer call.
This is typically used to create the weights of Layer subclasses.
 Arguments:
 input_shape: Instance of TensorShape, or list of instances of
 TensorShape if the layer expects a list of inputs (one instance per input).
LSTMCellÂ¶

class
tensorflow.python.ops.rnn_cell_impl.
LSTMCell
(num_units, use_peepholes=False, cell_clip=None, initializer=None, num_proj=None, proj_clip=None, num_unit_shards=None, num_proj_shards=None, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None, name=None, dtype=None, **kwargs)[source]Â¶ Long shortterm memory unit (LSTM) recurrent network cell.
The default nonpeephole implementation is based on:
Felix Gers, Jurgen Schmidhuber, and Fred Cummins. âLearning to forget: Continual prediction with LSTM.â IET, 850855, 1999.
The peephole implementation is based on:
Hasim Sak, Andrew Senior, and Francoise Beaufays. âLong shortterm memory recurrent neural network architectures for
large scale acoustic modeling.â INTERSPEECH, 2014.The class uses optional peephole connections, optional cell clipping, and an optional projection layer.
Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU, or tf.contrib.rnn.LSTMBlockCell and tf.contrib.rnn.LSTMBlockFusedCell for better performance on CPU.
Initialize the parameters for an LSTM cell. (deprecated)
Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
 Args:
num_units: int, The number of units in the LSTM cell. use_peepholes: bool, set True to enable diagonal/peephole connections. cell_clip: (optional) A float value, if provided the cell state is clipped
by this value prior to the cell output activation. initializer: (optional) The initializer to use for the weight and
 projection matrices.
 num_proj: (optional) int, The output dimensionality for the projection
 matrices. If None, no projection is performed.
 proj_clip: (optional) A float value. If num_proj > 0 and proj_clip is
 provided, then the projected values are clipped elementwise to within [proj_clip, proj_clip].
 num_unit_shards: Deprecated, will be removed by Jan. 2017. Use a
 variable_scope partitioner instead.
 num_proj_shards: Deprecated, will be removed by Jan. 2017. Use a
 variable_scope partitioner instead.
 forget_bias: Biases of the forget gate are initialized by default to 1 in
 order to reduce the scale of forgetting at the beginning of the training. Must set it manually to 0.0 when restoring from CudnnLSTM trained checkpoints.
 state_is_tuple: If True, accepted and returned states are 2tuples of the
 c_state and m_state. If False, they are concatenated along the column axis. This latter behavior will soon be deprecated.
 activation: Activation function of the inner states. Default: tanh. It
 could also be string that is within Keras activation function names.
 reuse: (optional) Python boolean describing whether to reuse variables in
 an existing scope. If not True, and the existing scope already has the given variables, an error is raised.
 name: String, the name of the layer. Layers with the same name will share
 weights, but to avoid mistakes we require reuse=True in such cases.
 dtype: Default dtype of the layer (default of None means use the type of
 the first input). Required when build is called before call.
 **kwargs: Dict, keyword named properties for common layer attributes, like
 trainable etc when constructing the cell from configs of get_config(). When restoring from CudnnLSTMtrained checkpoints, use CudnnCompatibleLSTMCell instead.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

call
(inputs, state)[source]Â¶ Run one step of LSTM.
 Args:
inputs: input Tensor, must be 2D, [batch, input_size]. state: if state_is_tuple is False, this must be a state Tensor, `2D,
[batch, state_size]`. If state_is_tuple is True, this must be a tuple of state Tensors, both 2D, with column sizes c_state and m_state. Returns:
A tuple containing:
A 2D, [batch, output_dim], Tensor representing the output of the LSTM after reading inputs when previous state was state. Here output_dim is:
num_proj if num_proj was set, num_units otherwise.
Tensor(s) representing the new state of LSTM after reading inputs when the previous state was state. Same type and shape(s) as state.
 Raises:
 ValueError: If input size cannot be inferred from inputs via
 static shape inference.

get_config
()[source]Â¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).
 Returns:
 Python dictionary.
TimeFreqLSTMCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
TimeFreqLSTMCell
(num_units, use_peepholes=False, cell_clip=None, initializer=None, num_unit_shards=1, forget_bias=1.0, feature_size=None, frequency_skip=1, reuse=None)[source]Â¶ TimeFrequency Long shortterm memory unit (LSTM) recurrent network cell.
This implementation is based on:
Tara N. Sainath and Bo Li âModeling TimeFrequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks.â submitted to INTERSPEECH, 2016.It uses peephole connections and optional cell clipping.
Initialize the parameters for an LSTM cell.
 Args:
num_units: int, The number of units in the LSTM cell use_peepholes: bool, set True to enable diagonal/peephole connections. cell_clip: (optional) A float value, if provided the cell state is clipped
by this value prior to the cell output activation. initializer: (optional) The initializer to use for the weight and
 projection matrices.
 num_unit_shards: int, How to split the weight matrix. If >1, the weight
 matrix is stored across num_unit_shards.
 forget_bias: float, Biases of the forget gate are initialized by default
 to 1 in order to reduce the scale of forgetting at the beginning of the training.
feature_size: int, The size of the input feature the LSTM spans over. frequency_skip: int, The amount the LSTM filter is shifted by in
frequency. reuse: (optional) Python boolean describing whether to reuse variables
 in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

call
(inputs, state)[source]Â¶ Run one step of LSTM.
 Args:
 inputs: input Tensor, 2D, batch x num_units. state: state Tensor, 2D, batch x state_size.
 Returns:
A tuple containing:  A 2D, batch x output_dim, Tensor representing the output of the LSTM
after reading âinputsâ when previous state was âstateâ. Here output_dim is num_units. A 2D, batch x state_size, Tensor representing the new state of LSTM after reading âinputsâ when previous state was âstateâ.
 Raises:
 ValueError: if an input_size was specified and the provided inputs have
 a different dimension.
TwoDNativeLstmCellÂ¶

class
returnn.tf.native_op.
TwoDNativeLstmCell
(pooling, **kwargs)[source]Â¶ Native 2D LSTM.

classmethod
map_layer_inputs_to_op
(X, V_h, V_v, W, i, previous_state=None, previous_output=None, iteration=None)[source]Â¶ Just like NativeOp.LstmGenericBase.map_layer_inputs_to_op(). :param tf.Tensor X: inputs: shape (timeT,timeS,batch,n_hidden*5) :param tf.Tensor V_h: W_re: shape (n_hidden,n_hidden*5) :param tf.Tensor V_v: W_re: shape (n_hidden,n_hidden*5) :param tf.Tensor W: :param tf.Tensor i: index: shape (time,batch) :param tf.Tensor previous_state: :param tf.Tensor previous_output: :param tf.Tensor iteration: :rtype: (tf.Tensor,tf.Tensor,tf.Tensor,tf.Tensor)

classmethod
UGRNNCellÂ¶

class
tensorflow.contrib.rnn.python.ops.rnn_cell.
UGRNNCell
(num_units, initializer=None, forget_bias=1.0, activation=<function tanh>, reuse=None)[source]Â¶ Update Gate Recurrent Neural Network (UGRNN) cell.
Compromise between a LSTM/GRU and a vanilla RNN. There is only one gate, and that is to determine whether the unit should be integrating or computing instantaneously. This is the recurrent idea of the feedforward Highway Network.
This implements the recurrent cell from the paper:
Jasmine Collins, Jascha SohlDickstein, and David Sussillo. âCapacity and Trainability in Recurrent Neural Networksâ Proc. ICLR 2017.
Initialize the parameters for an UGRNN cell.
 Args:
num_units: int, The number of units in the UGRNN cell initializer: (optional) The initializer to use for the weight matrices. forget_bias: (optional) float, default 1.0, The initial bias of the
forget gate, used to reduce the scale of forgetting at the beginning of the training. activation: (optional) Activation function of the inner states.
 Default is tf.tanh.
 reuse: (optional) Python boolean describing whether to reuse variables
 in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.

state_size
[source]Â¶ size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

call
(inputs, state)[source]Â¶ Run one step of UGRNN.
 Args:
 inputs: input Tensor, 2D, batch x input size. state: state Tensor, 2D, batch x num units.
 Returns:
 new_output: batch x num units, Tensor representing the output of the UGRNN
 after reading inputs when previous state was state. Identical to new_state.
 new_state: batch x num units, Tensor representing the state of the UGRNN
 after reading inputs when previous state was state.
 Raises:
 ValueError: If input size cannot be inferred from inputs via
 static shape inference.
ZoneoutLSTMCellÂ¶

class
returnn.tf.layers.rec.
ZoneoutLSTMCell
(num_units, zoneout_factor_cell=0.0, zoneout_factor_output=0.0)[source]Â¶ Wrapper for tf LSTM to create Zoneout LSTM Cell. This code is an adapted version of Rayhane Mamas version of Tacotron2
Refs:
Initializer with possibility to set different zoneout values for cell/hidden states.
Parameters:  num_units (int) â number of hidden units
 zoneout_factor_cell (float) â cell zoneout factor
 zoneout_factor_output (float) â output zoneout factor
OptimizerÂ¶
This is a list of all optimizers that can be used with RETURNN. If you are looking on how to set the optimizer correctly in the RETURNN config, please have a look at the optimizer settings.
AdadeltaÂ¶

class
tensorflow.python.training.adadelta.
AdadeltaOptimizer
(learning_rate=0.001, rho=0.95, epsilon=1e08, use_locking=False, name='Adadelta')[source]Â¶ Optimizer that implements the Adadelta algorithm.
See [M. D. Zeiler](http://arxiv.org/abs/1212.5701) ([pdf](http://arxiv.org/pdf/1212.5701v1.pdf))
Construct a new Adadelta optimizer.
 Args:
 learning_rate: A Tensor or a floating point value. The learning rate.
 To match the exact form in the original paper use 1.0.
rho: A Tensor or a floating point value. The decay rate. epsilon: A Tensor or a floating point value. A constant epsilon used
to better conditioning the grad update.use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to âAdadeltaâ.
@compatibility(eager) When eager execution is enabled, learning_rate, rho, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility
AdagradÂ¶

class
tensorflow.python.training.adagrad.
AdagradOptimizer
(learning_rate, initial_accumulator_value=0.1, use_locking=False, name='Adagrad')[source]Â¶ Optimizer that implements the Adagrad algorithm.
See this [paper](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) or this [intro](https://ppasupat.github.io/a9online/uploads/proximal_notes.pdf).
Construct a new Adagrad optimizer.
 Args:
learning_rate: A Tensor or a floating point value. The learning rate. initial_accumulator_value: A floating point value.
Starting value for the accumulators, must be positive.use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to âAdagradâ. Raises:
 ValueError: If the initial_accumulator_value is invalid.
@compatibility(eager) When eager execution is enabled, learning_rate can be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility
AdagradDAÂ¶

class
tensorflow.python.training.adagrad_da.
AdagradDAOptimizer
(learning_rate, global_step, initial_gradient_squared_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='AdagradDA')[source]Â¶ Adagrad Dual Averaging algorithm for sparse linear models.
See this [paper](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf).
This optimizer takes care of regularization of unseen features in a mini batch by updating them when they are seen with a closed form update rule that is equivalent to having updated them on every minibatch.
AdagradDA is typically used when there is a need for large sparsity in the trained model. This optimizer only guarantees sparsity for linear models. Be careful when using AdagradDA for deep networks as it will require careful initialization of the gradient accumulators for it to train.
Construct a new AdagradDA optimizer.
 Args:
learning_rate: A Tensor or a floating point value. The learning rate. global_step: A Tensor containing the current training step number. initial_gradient_squared_accumulator_value: A floating point value.
Starting value for the accumulators, must be positive. l1_regularization_strength: A float value, must be greater than or
 equal to zero.
 l2_regularization_strength: A float value, must be greater than or
 equal to zero.
use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to âAdagradDAâ. Raises:
 ValueError: If the initial_gradient_squared_accumulator_value is invalid.
AdamÂ¶

class
tensorflow.python.training.adam.
AdamOptimizer
(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e08, use_locking=False, name='Adam')[source]Â¶ Optimizer that implements the Adam algorithm.
See [Kingma et al., 2014](http://arxiv.org/abs/1412.6980) ([pdf](http://arxiv.org/pdf/1412.6980.pdf)).
Construct a new Adam optimizer.
Initialization:
$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$
The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper:
$$t := t + 1$$ $$lr_t := text{learning_rate} * sqrt{1  beta_2^t} / (1  beta_1^t)$$
$$m_t := beta_1 * m_{t1} + (1  beta_1) * g$$ $$v_t := beta_2 * v_{t1} + (1  beta_2) * g * g$$ $$variable := variable  lr_t * m_t / (sqrt{v_t} + epsilon)$$
The default value of 1e8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the âepsilonâ referred to here is âepsilon hatâ in the paper.
The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).
 Args:
learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay
rate for the 1st moment estimates. beta2: A float value or a constant float tensor. The exponential decay
 rate for the 2nd moment estimates.
 epsilon: A small constant for numerical stability. This epsilon is
 âepsilon hatâ in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.
Defaults to âAdamâ. @compatibility(eager) When eager execution is enabled, learning_rate, beta1, beta2, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility
AdaMaxÂ¶

class
tensorflow.contrib.opt.python.training.adamax.
AdaMaxOptimizer
(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e08, use_locking=False, name='AdaMax')[source]Â¶ Optimizer that implements the AdaMax algorithm.
Adamax is sometimes superior to adam, specially in models with embeddings, see [Kingma et al., 2014](http://arxiv.org/abs/1412.6980) ([pdf](http://arxiv.org/pdf/1412.6980.pdf)).
Construct a new AdaMax optimizer.
Initialization:
` m_0 < 0 (Initialize initial 1st moment vector) v_0 < 0 (Initialize the exponentially weighted infinity norm) t < 0 (Initialize timestep) `
The update rule for variable with gradient g uses an optimization described at the end of section 7.1 of the paper:
``` t < t + 1
m_t < beta1 * m_{t1} + (1  beta1) * g v_t < max(beta2 * v_{t1}, abs(g)) variable < variable  learning_rate / (1  beta1^t) * m_t / (v_t + epsilon) ```
Similar to AdamOptimizer, the epsilon is added for numerical stability (especially to get rid of division by zero when v_t = 0).
Contrast to AdamOptimizer, the sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) only updates variable slices and corresponding m_t, v_t terms when that part of the variable was used in the forward pass. This means that the sparse behavior is contrast to the dense behavior (similar to some momentum implementations which ignore momentum unless a variable slice was actually used).
 Args:
learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor.
The exponential decay rate for the 1st moment estimates. beta2: A float value or a constant float tensor.
 The exponential decay rate for the exponentially weighted infinity norm.
epsilon: A small constant for numerical stability. use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.
Defaults to âAdaMaxâ.
AdamGSÂ¶

class
tensorflow.contrib.opt.python.training.adam_gs_optimizer.
AdamGSOptimizer
(global_step=0, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e08, use_locking=False, name='Adam')[source]Â¶ Optimizer that implements the Adam algorithm.
See [Kingma et al., 2014](http://arxiv.org/abs/1412.6980) ([pdf](http://arxiv.org/pdf/1412.6980.pdf)).
Construct a new Adam optimizer.
Branched from tf.train.AdamOptimizer. The only difference is to pass global step for computing beta1 and beta2 accumulators, instead of having optimizer keep its own independent beta1 and beta2 accumulators as nonslot variables.
Initialization:
$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$
The update rule for variable with gradient g uses an optimization described at the end of section2 of the paper:
$$t := t + 1$$ $$lr_t := text{learning_rate} * sqrt{1  beta_2^t} / (1  beta_1^t)$$
$$m_t := beta_1 * m_{t1} + (1  beta_1) * g$$ $$v_t := beta_2 * v_{t1} + (1  beta_2) * g * g$$ $$variable := variable  lr_t * m_t / (sqrt{v_t} + epsilon)$$
The default value of 1e8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the âepsilonâ referred to here is âepsilon hatâ in the paper.
The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).
 Args:
global_step: tensorflow variable indicating the step. learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay
rate for the 1st moment estimates. beta2: A float value or a constant float tensor. The exponential decay
 rate for the 2nd moment estimates.
 epsilon: A small constant for numerical stability. This epsilon is
 âepsilon hatâ in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.
Defaults to âAdamâ. @compatibility(eager) When eager execution is enabled, learning_rate, beta1, beta2, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility
AdamWÂ¶

class
tensorflow.contrib.opt.python.training.weight_decay_optimizers.
AdamWOptimizer
(weight_decay, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e08, use_locking=False, name='AdamW')[source]Â¶ Optimizer that implements the Adam algorithm with weight decay.
This is an implementation of the AdamW optimizer described in [âFixing Weight Decay Regularization in Adamâ by Loshchilov & Hutter] (https://arxiv.org/abs/1711.05101) ([pdf](https://arxiv.org/pdf/1711.05101.pdf)).
It computes the update step of train.AdamOptimizer and additionally decays the variable. Note that this is different from adding L2 regularization on the variables to the loss: it regularizes variables with large gradients more than L2 regularization would, which was shown to yield better training loss and generalization error in the paper above.
For further information see the documentation of the Adam Optimizer.
Note that this optimizer can also be instantiated as
`python extend_with_weight_decay(tf.compat.v1.train.AdamOptimizer, weight_decay=weight_decay) `
Construct a new AdamW optimizer.
For further information see the documentation of the Adam Optimizer.
 Args:
weight_decay: A Tensor or a floating point value. The weight decay. learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay
rate for the 1st moment estimates. beta2: A float value or a constant float tensor. The exponential decay
 rate for the 2nd moment estimates.
 epsilon: A small constant for numerical stability. This epsilon is
 âepsilon hatâ in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.
Defaults to âAdamâ.
AddSignÂ¶

class
tensorflow.contrib.opt.python.training.addsign.
AddSignOptimizer
(learning_rate=0.1, alpha=1.0, beta=0.9, sign_decay_fn=None, use_locking=False, name='AddSignOptimizer')[source]Â¶ Optimizer that implements the AddSign update.
See [Bello et al., ICML2017], [Neural Optimizer Search with RL](https://arxiv.org/abs/1709.07417).
Constructs a new AddSignOptimizer object.
Initialization:
` m_0 < 0 (Initialize initial 1st moment vector) t < 0 (Initialize timestep) `
Update:
` t < t + 1 m_t < beta1 * m_{t1} + (1  beta1) * g sign_decay < sign_decay_fn(t) update < (alpha + sign_decay * sign(g) *sign(m)) * g variable < variable  lr_t * update `
Example for AddSignld (AddSign with linear sign decay)
` decay_steps = 1000 linear_decay_fn = sign_decays.get_linear_decay_fn(decay_steps) opt = AddSignOptimizer(learning_rate=0.1, sign_decay_fn=linear_decay_fn) `
 Args:
learning_rate: learning_rate used when taking a step. alpha: alpha used in optimizer. beta: decay used for computing the moving average m. sign_decay_fn: decay function applied to the sign(g) sign(m) quantity.
Takes global_step as an argument. See sign_decay.py for some examples.use_locking: If True, use locks for update operations. name: Optional name for the operations created when applying gradients.
Defaults to âAddSignOptimizerâ.

apply_gradients
(grads_and_vars, global_step=None, name=None)[source]Â¶ Apply gradients to variables.
This is the second part of minimize(). It returns an Operation that applies gradients.
 Args:
 grads_and_vars: List of (gradient, variable) pairs as returned by
 compute_gradients().
 global_step: Optional Variable to increment by one after the
 variables have been updated.
 name: Optional name for the returned operation. Default to the
 name passed to the Optimizer constructor.
 Returns:
 An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
 Raises:
 TypeError: If grads_and_vars is malformed. ValueError: If none of the variables have gradients. RuntimeError: If you should use _distributed_apply() instead.
AGNÂ¶

class
tensorflow.contrib.opt.python.training.agn_optimizer.
AGNOptimizer
(optimizer, num_worker, custom_getter, communication_period=10, use_locking=True, name='AGNOptimizer')[source]Â¶ Wrapper that implements the Accumulated GradientNormalization algorithm.
 Reference:
 Accumulated Gradient Normalization: Joeri Hermans ACML2017 https://arxiv.org/abs/1710.02368
Construct a new AGN optimizer.
 Args:
optimizer: input optimizer, can be sgd/momentum/adam etc. num_worker: The number of workers custom_getter: The AGNCustomGetter communication_period: An int point value to controls the frequency of the
communication between every worker and the ps.use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to âAGNOptimizerâ.

apply_gradients
(grads_and_vars, global_step=None, name=None)[source]Â¶ Apply gradients to global variables.
This is the second part of minimize(). It returns an Operation that applies gradients.
 Args:
 grads_and_vars: List of (gradient, variable) pairs as returned by
 compute_gradients().
 global_step: Optional Variable to increment by one after the variables
 have been updated.
 name: Optional name for the returned operation. Default to the name
 passed to the Optimizer constructor.
 Returns:
 An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
AMSGradÂ¶

class
returnn.tf.updater.
AMSGradOptimizer
(learning_rate=0.001, decay=False, beta1=0.9, beta2=0.99, epsilon=0.0, var_list=())[source]Â¶ https://colab.research.google.com/notebook#fileId=1xXFAuHM2AeOmF5M8Cn9ypGCa_HHBgfG&scrollTo=N12wPHN1Otn https://openreview.net/pdf?id=ryQu7fRZ https://keras.io/optimizers/ http://ruder.io/deeplearningoptimization2017/index.html#fixingtheexponentialmovingaverage https://github.com/taki0112/AMSGradTensorflow
BaseCustomÂ¶

class
returnn.tf.updater.
BaseCustomOptimizer
(learning_rate, use_locking=False, name=None)[source]Â¶ Base class for our own optimizer implementations. This simplifies the interface to be implemented a bit from
Optimizer
. You just have to implement_apply()
here. SeeCustomGradientDescentOptimizer
orCustomAdamOptimizer
for as an example.Construct a new optimizer.
 Args:
 learning_rate: A Tensor or a floating point value. The learning
 rate to use.
use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to self.__class__.__name__.
CustomAdamÂ¶

class
returnn.tf.updater.
CustomAdamOptimizer
(beta1=0.9, beta2=0.999, epsilon=1e08, **kwargs)[source]Â¶ Reimplementation of Adam. See also
tf.compat.v1.train.AdamOptimizer
.``` t < t + 1 lr_t < learning_rate * sqrt(1  beta2^t) / (1  beta1^t)
m_t < beta1 * m_{t1} + (1  beta1) * g v_t < beta2 * v_{t1} + (1  beta2) * g * g variable < variable  lr_t * m_t / (sqrt(v_t) + epsilon) ```
Parameters:  beta1 (float) â used for the running average of g (m)
 beta2 (float) â used for the running average of g*g (v)
 epsilon (float) â
CustomGradientDescentÂ¶

class
returnn.tf.updater.
CustomGradientDescentOptimizer
(learning_rate, use_locking=False, name=None)[source]Â¶ Just an example implementation for simple gradient descent.
Construct a new optimizer.
 Args:
 learning_rate: A Tensor or a floating point value. The learning
 rate to use.
use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to self.__class__.__name__.
DropStaleGradientÂ¶

class
tensorflow.contrib.opt.python.training.drop_stale_gradient_optimizer.
DropStaleGradientOptimizer
(opt, staleness, use_locking=False, name='DropStaleGradient')[source]Â¶ Wrapper optimizer that checks and drops stale gradient.
This optimizer records the global step for each worker before computing gradients and compares it with the global step at the time of applying the gradients. If the difference is larger than a threshold, it will drop all the computed gradients.
Constructs a new DropStaleGradientOptimizer.
 Args:
 opt: The actual optimizer that will be used to compute and apply the
 gradients. Must be one of the Optimizer classes.
staleness: The maximum staleness allowed for the optimizer. use_locking: If True use locks for clip update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to âDropStaleGradientâ.

compute_gradients
(loss, *args, **kwargs)[source]Â¶ Compute gradients of loss for the variables in var_list.
This is the first part of minimize(). It returns a list of (gradient, variable) pairs where âgradientâ is the gradient for âvariableâ. Note that âgradientâ can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.
 Args:
 loss: A Tensor containing the value to minimize or a callable taking
 no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.
 var_list: Optional list or tuple of tf.Variable to update to minimize
 loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
 gate_gradients: How to gate the computation of gradients. Can be
 GATE_NONE, GATE_OP, or GATE_GRAPH.
 aggregation_method: Specifies the method used to combine gradient terms.
 Valid values are defined in the class AggregationMethod.
 colocate_gradients_with_ops: If True, try colocating gradients with
 the corresponding op.
grad_loss: Optional. A Tensor holding the gradient computed for loss.
 Returns:
 A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.
 Raises:
TypeError: If var_list contains anything else than Variable objects. ValueError: If some arguments are invalid. RuntimeError: If called with eager execution enabled and loss is
not callable.
@compatibility(eager) When eager execution is enabled, gate_gradients, aggregation_method, and colocate_gradients_with_ops are ignored. @end_compatibility

get_slot
(*args, **kwargs)[source]Â¶ Return a slot named name created for var by the Optimizer.
Some Optimizer subclasses use additional variables. For example Momentum and Adagrad use variables to accumulate updates. This method gives access to these Variable objects if for some reason you need them.
Use get_slot_names() to get the list of slot names created by the Optimizer.
 Args:
 var: A variable passed to minimize() or apply_gradients(). name: A string.
 Returns:
 The Variable for the slot if it was created, None otherwise.

get_slot_names
(*args, **kwargs)[source]Â¶ Return a list of the names of slots created by the Optimizer.
See get_slot().
 Returns:
 A list of strings.

apply_gradients
(grads_and_vars, global_step=None, name=None)[source]Â¶ Apply gradients to variables.
This is the second part of minimize(). It returns an Operation that applies gradients.
 Args:
 grads_and_vars: List of (gradient, variable) pairs as returned by
 compute_gradients().
 global_step: Optional Variable to increment by one after the
 variables have been updated.
 name: Optional name for the returned operation. Default to the
 name passed to the Optimizer constructor.
 Returns:
 An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
 Raises:
 TypeError: If grads_and_vars is malformed. ValueError: If none of the variables have gradients. RuntimeError: If you should use _distributed_apply() instead.
ElasticAverageÂ¶

class
tensorflow.contrib.opt.python.training.elastic_average_optimizer.
ElasticAverageOptimizer
(opt, num_worker, ea_custom_getter, communication_period=10, moving_rate=None, rho=None, use_locking=True, synchronous=False, name='ElasticAverageOptimizer')[source]Â¶ Wrapper optimizer that implements the Elastic Average SGD algorithm.
This is an async optimizer. During the training, Each worker will update the local variables and maintains its own local_step, which starts from 0 and is incremented by 1 after each update of local variables. Whenever the communication period divides the local step, the worker requests the current global center variables and then computed the elastic difference between global center variables and local variables. The elastic difference then be used to update both local variables and global variables.
Construct a new gradient descent optimizer.
 Args:
 opt: The actual optimizer that will be used to update local variables.
 Must be one of the Optimizer classes.
num_worker: The number of workers ea_custom_getter: The ElasticAverageCustomGetter communication_period: An int point value to controls the frequency of the
communication between every worker and the ps.moving_rate: A floating point value to control the elastic difference. rho: the amount of exploration we allow in the model. The default value is
moving_rate/learning_rate rho=0.0 is suggested in async mode.use_locking: If True use locks for update operations. synchronous: Add_sync_queues_and_barrier or not.
True: all workers will wait for each other before start training False: worker can start training when its initilization is done,
no need to wait for everyone is ready. in case one worker is restarted, it can join and continue training without being blocked. name: Optional name prefix for the operations created when applying
 gradients. Defaults to âElasticAverageOptimizerâ.

compute_gradients
(loss, var_list=None, gate_gradients=1, aggregation_method=None, colocate_gradients_with_ops=False, grad_loss=None)[source]Â¶ Compute gradients of loss for the variables in var_list.
Add rho*elastic_difference to loss to control the exploration This is the first part of minimize(). It returns a list of (gradient, variable) pairs where âgradientâ is the gradient for âvariableâ. Note that âgradientâ can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.
 Args:
loss: A Tensor containing the value to minimize. var_list: Optional list or tuple of tf.Variable to update to minimize
loss. Defaults to the list of variables collected in the graph under the key GraphKey.TRAINABLE_VARIABLES. gate_gradients: How to gate the computation of gradients. Can be
 GATE_NONE, GATE_OP, or GATE_GRAPH.
 aggregation_method: Specifies the method used to combine gradient terms.
 Valid values are defined in the class AggregationMethod.
 colocate_gradients_with_ops: If True, try colocating gradients with the
 corresponding op.
grad_loss: Optional. A Tensor holding the gradient computed for loss.
 Returns:
 A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.
 Raises:
 TypeError: If var_list contains anything else than Variable objects. ValueError: If some arguments are invalid.

apply_gradients
(grads_and_vars, global_step=None, name=None)[source]Â¶ Apply gradients to global variables.
This is the second part of minimize(). It returns an Operation that applies gradients.
 Args:
 grads_and_vars: List of (gradient, variable) pairs as returned by
 compute_gradients().
 global_step: Optional Variable to increment by one after the variables
 have been updated.
 name: Optional name for the returned operation. Default to the name
 passed to the Optimizer constructor.
 Returns:
 An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
 Raises:
 TypeError: If grads_and_vars is malformed. ValueError: If none of the variables have gradients.

get_init_op
(task_index)[source]Â¶ Returns the op to let all the local variables and local center
variables equal to the global center variables before the training begins

make_session_run_hook
(is_chief, task_index)[source]Â¶ Creates a hook to handle ElasticAverageOptimizerHook ops such as initialization.

swapping_saver
(var_list=None, name='swapping_saver', **kwargs)[source]Â¶ Create a saver copy global_center_variable to trainable variables
Please call this function after all your variables created with ElasticAverageCustomGetter. For evaluations or inference, use this saver during training. It will save the global_center_variable of the trained parameters under the original parameter names. Args:
 var_list: List of variables to save, as per Saver(). If set to None,
 save all the trainable_variables that have been created before this call.
name: The name of the saver. **kwargs: Keyword arguments of Saver().
 Returns:
 A tf.compat.v1.train.Saver object.
 Raises:
 RuntimeError: global_center_variable is empty, please make sure
 this is called after model created and ElasticAverageCustomGetter is used when declaring you model
FtrlÂ¶

class
tensorflow.python.training.ftrl.
FtrlOptimizer
(learning_rate, learning_rate_power=0.5, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='Ftrl', accum_name=None, linear_name=None, l2_shrinkage_regularization_strength=0.0)[source]Â¶ Optimizer that implements the FTRL algorithm.
See this [paper]( https://www.eecs.tufts.edu/~dsculley/papers/adclickprediction.pdf). This version has support for both online L2 (the L2 penalty given in the paper above) and shrinkagetype L2 (which is the addition of an L2 penalty to the loss function).
Construct a new FTRL optimizer.
 Args:
learning_rate: A float value or a constant float Tensor. learning_rate_power: A float value, must be less or equal to zero.
Controls how the learning rate decreases during training. Use zero for a fixed learning rate. See section 3.1 in the [paper](https://www.eecs.tufts.edu/~dsculley/papers/adclickprediction.pdf). initial_accumulator_value: The starting value for accumulators.
 Only zero or positive values are allowed.
 l1_regularization_strength: A float value, must be greater than or
 equal to zero.
 l2_regularization_strength: A float value, must be greater than or
 equal to zero.
use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to âFtrlâ. accum_name: The suffix for the variable that keeps the gradient squared
 accumulator. If not present, defaults to name.
 linear_name: The suffix for the variable that keeps the linear gradient
 accumulator. If not present, defaults to name + â_1â.
 l2_shrinkage_regularization_strength: A float value, must be greater than
or equal to zero. This differs from L2 above in that the L2 above is a stabilization penalty, whereas this L2 shrinkage is a magnitude penalty. The FTRL formulation can be written as: w_{t+1} = argmin_w(hat{g}_{1:t}w + L1*w_1 + L2*w_2^2), where hat{g} = g + (2*L2_shrinkage*w), and g is the gradient of the loss function w.r.t. the weights w. Specifically, in the absence of L1 regularization, it is equivalent to the following update rule: w_{t+1} = w_t  lr_t / (1 + 2*L2*lr_t) * g_t 
2*L2_shrinkage*lr_t / (1 + 2*L2*lr_t) * w_twhere lr_t is the learning rate at t. When input is sparse shrinkage will only happen on the active weights.
 Raises:
 ValueError: If one of the arguments is invalid.
GGTÂ¶

class
tensorflow.contrib.opt.python.training.ggt.
GGTOptimizer
(learning_rate=0.001, beta1=0.9, use_locking=False, name='GGT', window=10, eps=0.0001, svd_eps=1e06, sigma_eps=0.01)[source]Â¶ Optimizer that implements the GGT algorithm.
GGT has an advantage over sgd and adam on large models with poor conditioning, for example language models and CNNs, see [[ABCHSZZ 2018]](https://arxiv.org/pdf/1806.02958.pdf).
Construct a new GGT optimizer.
Initialization:
``` t < 0 (Initialize timestep) grad_buffer < 0 (Initialize buffer for keeping past gradients) flat_grad < 0 (Initialize flattened gradient that contains gradients of all
variables)m_0 < 0 (Initialize 1st moment vector) ```
Suppose all variables and their gradients are concatenated into vectors flat_vars and flat_grad. The update rule for flat_vars uses an optimization described at the beginning of section 2 of the paper:
``` t < t + 1
m_t < beta1 * m_{t1} + (1  beta1) * flat_grad grad_buffer[(t1) % window, :] < m_t
M < grad_buffer^T / sqrt(min(t, window)) U, sigma, _ < SVD(M^TM + I * svd_eps)
sigma_sqrt_inv < (sqrt(sigma) + sigma_eps)^(3) sigma_sqrt_min < min(sqrt(sigma))
 if sigma_sqrt_min > eps:
 new_step < M U diag(sigma_sqrt_inv) U^T M^T m_t +
 (m_t  M U diag(1/sigma) U^T M^T m_t) / sigma_sqrt_min
 else:
 new_step < M U diag(sigma_sqrt_inv) U^T M^T m_t
flat_vars < flat_vars  learning_rate * new_step ```
GGT provides the power of fullmatrix adaptive regularization at a cost not much larger than SGD. As a result it is suited for large models where the gradient covariance matrix has a poor condition number that slows down first order methods. GGT uses the preconditioner from fullmatrix AdaGrad, with gradient history attenuated exponentially as in Adam, and truncated to a window parameter. It has provable guarantees even for nonconvex optimization that is never significantly worse than SGD and in some cases better.
 Args:
learning_rate: A float hyperparameter. The learning rate. beta1: A float hyperparameter. The exponential decay rate for the 1st
moment estimates.use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.
Defaults to âGGTâ. window: An integer hyperparameter. The number of first moments to keep in
 computing the adaptive preconditioner.
 eps: A float hyperparameter. Used to truncate small eigenvalues of the
 gradient covariance matrix.
svd_eps: A float hyperparameter. Used to stabilize SVD. sigma_eps: A float hyperparameter. Used to regularize matrix inversion.
GradientDescentÂ¶

class
tensorflow.python.training.gradient_descent.
GradientDescentOptimizer
(learning_rate, use_locking=False, name='GradientDescent')[source]Â¶ Optimizer that implements the gradient descent algorithm.
Construct a new gradient descent optimizer.
 Args:
 learning_rate: A Tensor or a floating point value. The learning
 rate to use.
use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to âGradientDescentâ.
@compatibility(eager) When eager execution is enabled, learning_rate can be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility
GradVarianceScaledÂ¶

class
returnn.tf.updater.
GradVarianceScaledOptimizer
(beta1=0.9, beta2=0.999, epsilon=1e08, **kwargs)[source]Â¶ Let m be the running average of g. Calculation of m: m_t < beta1 * m_{t1} + (1  beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9
Let v be the running average of the variance of g, i.e. of (g  m)^2.
Parameters:  beta1 (float) â used for the running average of g (m)
 beta2 (float) â used for the running average of variance of g (v)
 epsilon (float) â
KerasÂ¶
LARSÂ¶

class
tensorflow.contrib.opt.python.training.lars_optimizer.
LARSOptimizer
(learning_rate, momentum=0.9, weight_decay=0.0001, eeta=0.001, epsilon=0.0, name='LARSOptimizer', skip_list=None, use_nesterov=False)[source]Â¶ Layerwise Adaptive Rate Scaling for large batch training.
Introduced by âLarge Batch Training of Convolutional Networksâ by Y. You, I. Gitman, and B. Ginsburg. (https://arxiv.org/abs/1708.03888)
Implements the LARS learning rate scheme presented in the paper above. This optimizer is useful when scaling the batch size to up to 32K without significant performance degradation. It is recommended to use the optimizer in conjunction with:
 Gradual learning rate warmup
 Linear learning rate scaling
 Poly rule learning rate decay
Note, LARS scaling is currently only enabled for dense tensors. Sparse tensors use the default momentum optimizer.
Construct a new LARS Optimizer.
 Args:
learning_rate: A Tensor or floating point value. The base learning rate. momentum: A floating point value. Momentum hyperparameter. weight_decay: A floating point value. Weight decay hyperparameter. eeta: LARS coefficient as used in the paper. Dfault set to LARS
coefficient from the paper. (eeta / weight_decay) determines the highest scaling factor in LARS. epsilon: Optional epsilon parameter to be set in models that have very
 small gradients. Default set to 0.0.
name: Optional name prefix for variables and ops created by LARSOptimizer. skip_list: List of strings to enable skipping variables from LARS scaling.
If any of the strings in skip_list is a subset of var.name, variable âvarâ is skipped from LARS scaling. For a typical classification model with batch normalization, the skip_list is [âbatch_normalizationâ, âbiasâ]use_nesterov: when set to True, nesterov momentum will be enabled
 Raises:
 ValueError: If a hyperparameter is set to a nonsensical value.
LazyAdamÂ¶

class
tensorflow.contrib.opt.python.training.lazy_adam_optimizer.
LazyAdamOptimizer
(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e08, use_locking=False, name='Adam')[source]Â¶ Variant of the Adam optimizer that handles sparse updates more efficiently.
The original Adam algorithm maintains two movingaverage accumulators for each trainable variable; the accumulators are updated at every step. This class provides lazier handling of gradient updates for sparse variables. It only updates movingaverage accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.
Construct a new Adam optimizer.
Initialization:
$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$
The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper:
$$t := t + 1$$ $$lr_t := text{learning_rate} * sqrt{1  beta_2^t} / (1  beta_1^t)$$
$$m_t := beta_1 * m_{t1} + (1  beta_1) * g$$ $$v_t := beta_2 * v_{t1} + (1  beta_2) * g * g$$ $$variable := variable  lr_t * m_t / (sqrt{v_t} + epsilon)$$
The default value of 1e8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the âepsilonâ referred to here is âepsilon hatâ in the paper.
The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).
 Args:
learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay
rate for the 1st moment estimates. beta2: A float value or a constant float tensor. The exponential decay
 rate for the 2nd moment estimates.
 epsilon: A small constant for numerical stability. This epsilon is
 âepsilon hatâ in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.
Defaults to âAdamâ. @compatibility(eager) When eager execution is enabled, learning_rate, beta1, beta2, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility
LazyAdamGSÂ¶

class
tensorflow.contrib.opt.python.training.lazy_adam_gs_optimizer.
LazyAdamGSOptimizer
(global_step=0, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e08, use_locking=False, name='Adam')[source]Â¶ Variant of the Adam optimizer that handles sparse updates more efficiently.
Branched from tf.contrib.opt.LazyAdamGSOptimizer. The only difference is to pass global step for computing beta1 and beta2 accumulators, instead of having optimizer keep its own independent beta1 and beta2 accumulators as nonslot variables.
The original Adam algorithm maintains two movingaverage accumulators for each trainable variable; the accumulators are updated at every step. This class provides lazier handling of gradient updates for sparse variables. It only updates movingaverage accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.
Construct a new Adam optimizer.
Branched from tf.train.AdamOptimizer. The only difference is to pass global step for computing beta1 and beta2 accumulators, instead of having optimizer keep its own independent beta1 and beta2 accumulators as nonslot variables.
Initialization:
$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$
The update rule for variable with gradient g uses an optimization described at the end of section2 of the paper:
$$t := t + 1$$ $$lr_t := text{learning_rate} * sqrt{1  beta_2^t} / (1  beta_1^t)$$
$$m_t := beta_1 * m_{t1} + (1  beta_1) * g$$ $$v_t := beta_2 * v_{t1} + (1  beta_2) * g * g$$ $$variable := variable  lr_t * m_t / (sqrt{v_t} + epsilon)$$
The default value of 1e8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the âepsilonâ referred to here is âepsilon hatâ in the paper.
The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).
 Args:
global_step: tensorflow variable indicating the step. learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay
rate for the 1st moment estimates. beta2: A float value or a constant float tensor. The exponential decay
 rate for the 2nd moment estimates.
 epsilon: A small constant for numerical stability. This epsilon is
 âepsilon hatâ in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.
Defaults to âAdamâ. @compatibility(eager) When eager execution is enabled, learning_rate, beta1, beta2, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility
ModelAverageÂ¶

class
tensorflow.contrib.opt.python.training.model_average_optimizer.
ModelAverageOptimizer
(opt, num_worker, is_chief, ma_custom_getter, interval_steps=100, use_locking=True, name='ModelAverageOptimizer')[source]Â¶ Wrapper optimizer that implements the Model Average algorithm.
This is a sync optimizer. During the training, each worker will update the local variables and maintains its own local_step, which starts from 0 and is incremented by 1 after each update of local variables. Whenever the interval_steps divides the local step, the local variables from all the workers will be averaged and assigned to global center variables. Then the local variables will be assigned by global center variables.
Construct a new model average optimizer.
 Args:
opt: The actual optimizer that will be used to update local variables num_worker: The number of workers is_chief: whether chief worker ma_custom_getter: ModelAverageCustomGetter interval_steps: An int point value to controls the frequency of the
average of local variablesuse_locking: If True use locks for update operations name: string. Optional name of the returned operation

compute_gradients
(*args, **kwargs)[source]Â¶ Compute gradients of âlossâ for the variables in âvar_listâ.
This simply wraps the compute_gradients() from the real optimizer.

apply_gradients
(grads_and_vars, global_step=None, name=None)[source]Â¶ Apply gradients to variables.
This contains most of the synchronization implementation and also wraps the apply_gradients() from the real optimizer. The chief work updates global variables.
 Args:
 grads_and_vars: List of (gradient, variable) pairs as returned by
 compute_gradients().
 global_step: Optional Variable to increment by one after the variables
 have been updated.
 name: Optional name for the returned operation. Default to the name
 passed to the Optimizer constructor.
 Returns:
 A conditional âOperationâ that update both local and global variables or just local variables
 Raises:
ValueError: If the grads_and_vars is empty. ValueError: If global step is not provided, the staleness cannot be
checked.
MomentumÂ¶

class
tensorflow.python.training.momentum.
MomentumOptimizer
(learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=False)[source]Â¶ Optimizer that implements the Momentum algorithm.
Computes (if use_nesterov = False):
` accumulation = momentum * accumulation + gradient variable = learning_rate * accumulation `
Note that in the dense version of this algorithm, accumulation is updated and applied regardless of a gradientâs value, whereas the sparse version (when the gradient is an IndexedSlices, typically because of tf.gather or an embedding) only updates variable slices and corresponding accumulation terms when that part of the variable was used in the forward pass.
Construct a new Momentum optimizer.
 Args:
learning_rate: A Tensor or a floating point value. The learning rate. momentum: A Tensor or a floating point value. The momentum. use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to âMomentumâ. use_nesterov: If True use Nesterov Momentum.
 See [Sutskever et al., 2013]( http://jmlr.org/proceedings/papers/v28/sutskever13.pdf). This implementation always computes gradients at the value of the variable(s) passed to the optimizer. Using Nesterov Momentum makes the variable(s) track the values called theta_t + mu*v_t in the paper. This implementation is an approximation of the original formula, valid for high values of momentum. It will compute the âadjusted gradientâ in NAG by assuming that the new gradient will be estimated by the current average gradient plus the product of momentum and the change in the average gradient.
@compatibility(eager) When eager execution is enabled, learning_rate and momentum can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility
MomentumWÂ¶

class
tensorflow.contrib.opt.python.training.weight_decay_optimizers.
MomentumWOptimizer
(weight_decay, learning_rate, momentum, use_locking=False, name='MomentumW', use_nesterov=False)[source]Â¶ Optimizer that implements the Momentum algorithm with weight_decay.
This is an implementation of the SGDW optimizer described in âFixing Weight Decay Regularization in Adamâ by Loshchilov & Hutter (https://arxiv.org/abs/1711.05101) ([pdf])(https://arxiv.org/pdf/1711.05101.pdf). It computes the update step of train.MomentumOptimizer and additionally decays the variable. Note that this is different from adding L2 regularization on the variables to the loss. Decoupling the weight decay from other hyperparameters (in particular the learning rate) simplifies hyperparameter search.
For further information see the documentation of the Momentum Optimizer.
Note that this optimizer can also be instantiated as ```python extend_with_weight_decay(tf.compat.v1.train.MomentumOptimizer,
weight_decay=weight_decay)Construct a new MomentumW optimizer.
For further information see the documentation of the Momentum Optimizer.
 Args:
weight_decay: A Tensor or a floating point value. The weight decay. learning_rate: A Tensor or a floating point value. The learning rate. momentum: A Tensor or a floating point value. The momentum. use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to âMomentumâ. use_nesterov: If True use Nesterov Momentum. See [Sutskever et al.,
2013]( http://jmlr.org/proceedings/papers/v28/sutskever13.pdf). This
implementation always computes gradients at the value of the variable(s) passed to the optimizer. Using Nesterov Momentum makes the variable(s) track the values called theta_t + mu*v_t in the paper. @compatibility(eager) When eager execution is enabled, learning_rate, weight_decay and momentum can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility
MovingAverageÂ¶

class
tensorflow.contrib.opt.python.training.moving_average_optimizer.
MovingAverageOptimizer
(opt, average_decay=0.9999, num_updates=None, sequential_update=True)[source]Â¶ Optimizer that computes a moving average of the variables.
Empirically it has been found that using the moving average of the trained parameters of a deep network is better than using its trained parameters directly. This optimizer allows you to compute this moving average and swap the variables at save time so that any code outside of the training loop will use by default the averaged values instead of the original ones.
Example of usage:
// Encapsulate your favorite optimizer (here the momentum one) // inside the MovingAverageOptimizer. opt = tf.compat.v1.train.MomentumOptimizer(learning_rate, FLAGS.momentum) opt = tf.contrib.opt.MovingAverageOptimizer(opt) // Then create your model and all its variables. model = build_model() // Add the training op that optimizes using opt. // This needs to be called before swapping_saver(). opt.minimize(cost, var_list) // Then create your saver like this: saver = opt.swapping_saver() // Pass it to your training loop.
 slim.learning.train(
 model, âŠ saver=saver)
Note that for evaluation, the normal saver should be used instead of swapping_saver().
Construct a new MovingAverageOptimizer.
 Args:
opt: A tf.Optimizer that will be used to compute and apply gradients. average_decay: Float. Decay to use to maintain the moving averages
of trained variables. See tf.train.ExponentialMovingAverage for details. num_updates: Optional count of number of updates applied to variables.
 See tf.train.ExponentialMovingAverage for details.
 sequential_update: Bool. If False, will compute the moving average at the
 same time as the model is updated, potentially doing benign data races. If True, will update the moving average after gradient updates.

compute_gradients
(*args, **kwargs)[source]Â¶ Compute gradients of loss for the variables in var_list.
This is the first part of minimize(). It returns a list of (gradient, variable) pairs where âgradientâ is the gradient for âvariableâ. Note that âgradientâ can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.
 Args:
 loss: A Tensor containing the value to minimize or a callable taking
 no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.
 var_list: Optional list or tuple of tf.Variable to update to minimize
 loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
 gate_gradients: How to gate the computation of gradients. Can be
 GATE_NONE, GATE_OP, or GATE_GRAPH.
 aggregation_method: Specifies the method used to combine gradient terms.
 Valid values are defined in the class AggregationMethod.
 colocate_gradients_with_ops: If True, try colocating gradients with
 the corresponding op.
grad_loss: Optional. A Tensor holding the gradient computed for loss.
 Returns:
 A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.
 Raises:
TypeError: If var_list contains anything else than Variable objects. ValueError: If some arguments are invalid. RuntimeError: If called with eager execution enabled and loss is
not callable.
@compatibility(eager) When eager execution is enabled, gate_gradients, aggregation_method, and colocate_gradients_with_ops are ignored. @end_compatibility

apply_gradients
(grads_and_vars, global_step=None, name=None)[source]Â¶ Apply gradients to variables.
This is the second part of minimize(). It returns an Operation that applies gradients.
 Args:
 grads_and_vars: List of (gradient, variable) pairs as returned by
 compute_gradients().
 global_step: Optional Variable to increment by one after the
 variables have been updated.
 name: Optional name for the returned operation. Default to the
 name passed to the Optimizer constructor.
 Returns:
 An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
 Raises:
 TypeError: If grads_and_vars is malformed. ValueError: If none of the variables have gradients. RuntimeError: If you should use _distributed_apply() instead.

swapping_saver
(var_list=None, name='swapping_saver', **kwargs)[source]Â¶ Create a saver swapping moving averages and variables.
You should use this saver during training. It will save the moving averages of the trained parameters under the original parameter names. For evaluations or inference you should use a regular saver and it will automatically use the moving averages for the trained variable.
You must call this function after all variables have been created and after you have called Optimizer.minimize().
 Args:
 var_list: List of variables to save, as per Saver().
 If set to None, will save all the variables that have been created before this call.
name: The name of the saver. **kwargs: Keyword arguments of Saver().
 Returns:
 A tf.compat.v1.train.Saver object.
 Raises:
RuntimeError: If apply_gradients or minimize has not been called before. ValueError: If var_list is provided and contains some variables but not
their moving average counterpart.
NadamÂ¶

class
tensorflow.contrib.opt.python.training.nadam_optimizer.
NadamOptimizer
(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e08, use_locking=False, name='Adam')[source]Â¶ Optimizer that implements the Nadam algorithm.
See [Dozat, T., 2015](http://cs229.stanford.edu/proj2015/054_report.pdf).
Construct a new Adam optimizer.
Initialization:
$$m_0 := 0 text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 text{(Initialize initial 2nd moment vector)}$$ $$t := 0 text{(Initialize timestep)}$$
The update rule for variable with gradient g uses an optimization described at the end of section 2 of the paper:
$$t := t + 1$$ $$lr_t := text{learning_rate} * sqrt{1  beta_2^t} / (1  beta_1^t)$$
$$m_t := beta_1 * m_{t1} + (1  beta_1) * g$$ $$v_t := beta_2 * v_{t1} + (1  beta_2) * g * g$$ $$variable := variable  lr_t * m_t / (sqrt{v_t} + epsilon)$$
The default value of 1e8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the âepsilonâ referred to here is âepsilon hatâ in the paper.
The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).
 Args:
learning_rate: A Tensor or a floating point value. The learning rate. beta1: A float value or a constant float tensor. The exponential decay
rate for the 1st moment estimates. beta2: A float value or a constant float tensor. The exponential decay
 rate for the 2nd moment estimates.
 epsilon: A small constant for numerical stability. This epsilon is
 âepsilon hatâ in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
use_locking: If True use locks for update operations. name: Optional name for the operations created when applying gradients.
Defaults to âAdamâ. @compatibility(eager) When eager execution is enabled, learning_rate, beta1, beta2, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility
NeuralOÂ¶

class
returnn.tf.updater.
NeuralOptimizer1
(beta1=0.9, decrease_factor=0.1, **kwargs)[source]Â¶ Via Neural Optimizer Search with Reinforcement Learning (http://proceedings.mlr.press/v70/bello17a/bello17a.pdf).
Equivalent to the optimizer g * exp(sign(g) * sign(m)), we use:
g * where(sign(g) == sign(m), 1.0, decrease_factor)where m is the running average of g.
Calculation of m: m_t < beta1 * m_{t1} + (1  beta1) * g Same beta1 default as in Adam and in the paper: beta1=0.9
Parameters:  beta1 (float) â used for the running average of m
 decrease_factor (float) â in the original paper, it is e^2 ~= 0.135
NormÂ¶

class
returnn.tf.updater.
NormalizedSGD
(learning_rate, use_locking=False, name=None)[source]Â¶ All grads are L2 normalized (via
tf.nn.l2_normalize()
), otherwise itâs standard SGD. Via: https://github.com/kmkolasinski/deeplearningnotes/tree/master/maxnormedoptimizerConstruct a new optimizer.
 Args:
 learning_rate: A Tensor or a floating point value. The learning
 rate to use.
use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to self.__class__.__name__.
PowerSignÂ¶

class
tensorflow.contrib.opt.python.training.powersign.
PowerSignOptimizer
(learning_rate=0.1, base=2.718281828459045, beta=0.9, sign_decay_fn=None, use_locking=False, name='PowerSignOptimizer')[source]Â¶ Optimizer that implements the PowerSign update.
See [Bello et al., ICML2017], [Neural Optimizer Search with RL](https://arxiv.org/abs/1709.07417).
Constructs a new PowerSignOptimizer object.
Initialization:
` m_0 < 0 (Initialize initial 1st moment vector) t < 0 (Initialize timestep) `
Update:
` t < t + 1 m_t < beta1 * m_{t1} + (1  beta1) * g sign_decay < sign_decay_fn(t) update < base ** (sign_decay * sign(g) * sign(m)) * g variable < variable  lr_t * update `
Example usage for PowerSigncd (PowerSign with cosine sign decay)
` decay_steps = 1000 linear_decay_fn = sign_decays.get_cosine_decay_fn(decay_steps) opt = PowerSignOptimizer(learning_rate=0.1, sign_decay_fn=linear_decay_fn) `
 Args:
learning_rate: learning_rate used when taking a step. base: base used in optimizer. beta: decay used for computing the moving average m. sign_decay_fn: decay function applied to the sign(g) sign(m) quantity.
Takes global_step as an argument. See sign_decay.py for some examples.use_locking: If True, use locks for update operations. name: Optional name for the operations created iwhen applying gradients.
Defaults to âPowerSignOptimizerâ.

apply_gradients
(grads_and_vars, global_step=None, name=None)[source]Â¶ Apply gradients to variables.
This is the second part of minimize(). It returns an Operation that applies gradients.
 Args:
 grads_and_vars: List of (gradient, variable) pairs as returned by
 compute_gradients().
 global_step: Optional Variable to increment by one after the
 variables have been updated.
 name: Optional name for the returned operation. Default to the
 name passed to the Optimizer constructor.
 Returns:
 An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.
 Raises:
 TypeError: If grads_and_vars is malformed. ValueError: If none of the variables have gradients. RuntimeError: If you should use _distributed_apply() instead.
ProximalAdagradÂ¶

class
tensorflow.python.training.proximal_adagrad.
ProximalAdagradOptimizer
(learning_rate, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='ProximalAdagrad')[source]Â¶ Optimizer that implements the Proximal Adagrad algorithm.
See this [paper](http://papers.nips.cc/paper/3793efficientlearningusingforwardbackwardsplitting.pdf).
Construct a new ProximalAdagrad optimizer.
 Args:
learning_rate: A Tensor or a floating point value. The learning rate. initial_accumulator_value: A floating point value.
Starting value for the accumulators, must be positive. l1_regularization_strength: A float value, must be greater than or
 equal to zero.
 l2_regularization_strength: A float value, must be greater than or
 equal to zero.
use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to âAdagradâ. Raises:
 ValueError: If the initial_accumulator_value is invalid.
ProximalGradientDescentÂ¶

class
tensorflow.python.training.proximal_gradient_descent.
ProximalGradientDescentOptimizer
(learning_rate, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='ProximalGradientDescent')[source]Â¶ Optimizer that implements the proximal gradient descent algorithm.
See this [paper](http://papers.nips.cc/paper/3793efficientlearningusingforwardbackwardsplitting.pdf).
Construct a new proximal gradient descent optimizer.
 Args:
 learning_rate: A Tensor or a floating point value. The learning
 rate to use.
 l1_regularization_strength: A float value, must be greater than or
 equal to zero.
 l2_regularization_strength: A float value, must be greater than or
 equal to zero.
use_locking: If True use locks for update operations. name: Optional name prefix for the operations created when applying
gradients. Defaults to âGradientDescentâ.
RegAdagradÂ¶

class
tensorflow.contrib.opt.python.training.reg_adagrad_optimizer.
RegAdagradOptimizer
(learning_rate, initial_accumulator_value=0.1, use_locking=False, name='RegAdagrad')[source]Â¶ RegAdagrad: Adagrad with updates that optionally skip updating the slots.
This is meant to address the problem of additional regularization terms in the loss function affecting learning rate decay and causing hyperparam entanglement. Example usage:
loss = tf.nn.cross_entropy(x, labels) reg_loss = reg_strength * tf.reduce_sum(x * x) opt = tf.contrib.opt.RegAdagradOptimizer(learning_rate) loss_update = opt.minimize(loss) with opt.avoid_updating_slots():
reg_update = opt.minimize(reg_loss)total_update = tf.group([loss_update, reg_update])
# âŠ
sess.run(total_update, âŠ)
RMSPropÂ¶

class
tensorflow.python.training.rmsprop.
RMSPropOptimizer
(learning_rate, decay=0.9, momentum=0.0, epsilon=1e10, use_locking=False, centered=False, name='RMSProp')[source]Â¶ Optimizer that implements the RMSProp algorithm.
See the [paper](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf).
Construct a new RMSProp optimizer.
Note that in the dense implementation of this algorithm, variables and their corresponding accumulators (momentum, gradient moving average, square gradient moving average) will be updated even if the gradient is zero (i.e. accumulators will decay, momentum will be applied). The sparse implementation (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) will not update variable slices or their accumulators unless those slices were used in the forward pass (nor is there an âeventualâ correction to account for these omitted updates). This leads to more efficient updates for large embedding lookup tables (where most of the slices are not accessed in a particular graph execution), but differs from the published algorithm.
 Args:
learning_rate: A Tensor or a floating point value. The learning rate. decay: Discounting factor for the history/coming gradient momentum: A scalar tensor. epsilon: Small value to avoid zero denominator. use_locking: If True use locks for update operation. centered: If True, gradients are normalized by the estimated variance of
the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False. name: Optional name prefix for the operations created when applying
 gradients. Defaults to âRMSPropâ.
@compatibility(eager) When eager execution is enabled, learning_rate, decay, momentum, and epsilon can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility
ShampooÂ¶

class
tensorflow.contrib.opt.python.training.shampoo.
ShampooOptimizer
(global_step=0, max_matrix_size=768, gbar_decay=0.0, gbar_weight=1.0, mat_gbar_decay=1.0, mat_gbar_weight=1.0, learning_rate=1.0, svd_interval=1, precond_update_interval=1, epsilon=0.0001, alpha=0.5, use_iterative_root=False, use_locking=False, name='Shampoo')[source]Â¶ The Shampoo Optimizer
Variant of Adagrad using one preconditioner matrix per variable dimension. For details, see https://arxiv.org/abs/1802.09568
gbar is timeweighted accumulated gradient: gbar[t] = gbar_decay[t] * gbar[t1] + gbar_weight[t] * g[t]
mat_gbar is timeweighted accumulated gradient square: mat_gbar_j[t] = mat_gbar_decay[t] * mat_gbar_j[t1]
 mat_gbar_weight[t] * gg_j[t]
where if g[t] = g_abcd then gg_a[t] = g_abcd g_aâbcd (Einstein notation)
Update rule: w[t+1] = w[t]  learning_rate[t] * Prod_j mat_gbar_j[t]^(alpha/n) gbar[t]
Again, mat_gbar_j[t]^(alpha) gbar[t] is a tensor contraction along the jâth dimension of gbar[t] with the first dimension of mat_gbar_j[t]^(alpha/n), where alpha is a hyperparameter, and n = rank of the variable. Prod_j represents doing this contraction for all j in 0..n1.Typically learning_rate is constant, but could be time dependent by passing a lambda function that depends on step.
Default values of the various hyperparameters.
gbar_decay, gbar_weight etc. can be a float or a time varying parameter. For timevarying parameters use e.g. âlambda T: T / (T + 1.0)â where the expression in the lambda is a tensorflow expression
 Args:
global_step: tensorflow variable indicating the step. max_matrix_size: We do not perform SVD for matrices larger than this. gbar_decay: gbar_weight: Used to update gbar:
gbar[t] = gbar_decay[t] * gbar[t1] + gbar_weight[t] * g[t]mat_gbar_decay: mat_gbar_weight: Used to update mat_gbar:
 mat_gbar_j[t] = mat_gbar_decay[t] * mat_gbar_j[t1]
 mat_gbar_weight[t] * gg_j[t]
learning_rate: Similar to SGD svd_interval: We should do SVD after this many steps. Default = 1, i.e.
every step. Usually 20 leads to no loss of accuracy, and 50 or 100 is also OK. May also want more often early, and less often later  set in caller as for example: âsvd_interval = lambda(T): tf.cond(
T < 2000, lambda: 20.0, lambda: 1000.0)â precond_update_interval: We should update the preconditioners after
 this many steps. Default = 1. Usually less than svd_interval.
 epsilon: epsilon * I_n is added to each mat_gbar_j for stability for
 nondiagonal version of shampoo.
alpha: total power of the preconditioners. use_iterative_root: should the optimizer use SVD (faster) or the
iterative root method (for TPU) for finding the roots of PSD matrices.use_locking: name: name of optimizer.
SyncReplicasÂ¶

class
tensorflow.python.training.sync_replicas_optimizer.
SyncReplicasOptimizer
(opt, replicas_to_aggregate, total_num_replicas=None, variable_averages=None, variables_to_average=None, use_locking=False, name='sync_replicas')[source]Â¶ Class to synchronize, aggregate gradients and pass them to the optimizer.
This class is deprecated. For synchrononous training, please use [Distribution Strategies](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).
In a typical asynchronous training environment, itâs common to have some stale gradients. For example, with a Nreplica asynchronous training, gradients will be applied to the variables N times i