Native operations¶

Motivation:

Speed up some important common calculations, and potentially reduce memory requirements. Examples:
- LSTM
- CTC loss
Pure TensorFlow implementations can be suboptimal
- TF ops almost always create copies, even SplitOp etc
  - Not a memory problem, as input tensor will get freed if not used further
  - Performance problem
- Gradient might be suboptimal
  - Require too much memory (see automatic gradient checkpointing for a solution)
  - No automatic optimization
  - (Could be solved by custom TF gradient)
- Memory can be too much distributed (tf.TensorArray, TF Stack)
  - Esp. problematic in loop: Separate tensor for every iteration
  - Much better to allocate it as consecutive / contiguous block
- Overhead (calling individual TF ops, etc) (minor compared to the other points) (XLA can partially also solve this)

Solution: Write native (C++/CUDA) code

Why is native code faster?

Operate inplace on tensors
- Solves all problems mentioned, no unnecessary copies
- Can use consecutive tensor / memory
Enforces custom gradient implementation

Problems with native code:

Our Approach in RETURNN:

The NativeOp framework. See returnn.native_op, returnn.tf.native_op, returnn.theano.native_op.

History:

Already available for the Theano backend
Ported to TensorFlow
- Directly support for all already prev. implemented ops (LSTM, Baum Welch aligner, …)
Easy to port to other frameworks

Examples: