Native operations#
Motivation:
Speed up some important common calculations, and potentially reduce memory requirements. Examples:
LSTM
CTC loss
Pure TensorFlow implementations can be suboptimal
TF ops almost always create copies, even SplitOp etc
Not a memory problem, as input tensor will get freed if not used further
Performance problem
Gradient might be suboptimal
Require too much memory (see automatic gradient checkpointing for a solution)
No automatic optimization
(Could be solved by custom TF gradient)
Memory can be too much distributed (
tf.TensorArray
, TFStack
)Esp. problematic in loop: Separate tensor for every iteration
Much better to allocate it as consecutive / contiguous block
Overhead (calling individual TF ops, etc) (minor compared to the other points) (XLA can partially also solve this)
Solution: Write native (C++/CUDA) code
Why is native code faster?
Operate inplace on tensors
Solves all problems mentioned, no unnecessary copies
Can use consecutive tensor / memory
Enforces custom gradient implementation
Problems with native code:
Can be difficult, memory unsafe, needs more debugging
Need multiple implementations: CPU (C++), GPU (CUDA)
Our Approach in RETURNN:
The NativeOp framework.
See returnn.native_op
, returnn.tf.native_op
, returnn.theano.native_op
.
Some wrapper / helper code to simplify writing custom native op
Abstractions to allow single code for CPU & GPU
Write kernel CUDA style, using
threadIdx
,blockIdx
, etcKernel code must be flexible for amount of threads
Example, LSTM kernel, loop over dimensions, executed per time-frame:
int idx = threadIdx.x + blockDim.x ∗ blockIdx.x; while (idx < n_cells ∗ n_batch) { int batch_idx = idx / n_cells; int cell_idx = idx % n_cells; ... idx += gridDim.x ∗ blockDim.x; }
On CPU
Custom
gridDim
,blockDim
Other CUDA-like wrappers
History:
Already available for the Theano backend
Ported to TensorFlow
Directly support for all already prev. implemented ops (LSTM, Baum Welch aligner, …)
Easy to port to other frameworks
Examples:
NativeLstm
(LstmGenericBase
)NativeLstm2
TwoDLSTM
FastBaumWelch
FastViterbi
OptimalCompletionEditDistance
EditDistance
Chunking
,UnChunking
See also TensorFlow LSTM Benchmark.