Native operations

Motivation:

  • Speed up some important common calculations, and potentially reduce memory requirements. Examples:

    • LSTM

    • CTC loss

  • Pure TensorFlow implementations can be suboptimal

    • TF ops almost always create copies, even SplitOp etc

      • Not a memory problem, as input tensor will get freed if not used further

      • Performance problem

    • Gradient might be suboptimal

      • Require too much memory (see automatic gradient checkpointing for a solution)

      • No automatic optimization

      • (Could be solved by custom TF gradient)

    • Memory can be too much distributed (tf.TensorArray, TF Stack)

      • Esp. problematic in loop: Separate tensor for every iteration

      • Much better to allocate it as consecutive / contiguous block

    • Overhead (calling individual TF ops, etc) (minor compared to the other points) (XLA can partially also solve this)

Solution: Write native (C++/CUDA) code

Why is native code faster?

  • Operate inplace on tensors

    • Solves all problems mentioned, no unnecessary copies

    • Can use consecutive tensor / memory

  • Enforces custom gradient implementation

Problems with native code:

  • Can be difficult, memory unsafe, needs more debugging

  • Need multiple implementations: CPU (C++), GPU (CUDA)

Our Approach in RETURNN:

The NativeOp framework. See returnn.native_op, returnn.tf.native_op, returnn.theano.native_op.

  • Some wrapper / helper code to simplify writing custom native op

  • Abstractions to allow single code for CPU & GPU

    • Write kernel CUDA style, using threadIdx, blockIdx, etc

      • Kernel code must be flexible for amount of threads

      • Example, LSTM kernel, loop over dimensions, executed per time-frame:

        int idx = threadIdx.x + blockDim.x  blockIdx.x;
        while (idx < n_cells  n_batch) {
            int batch_idx = idx / n_cells;
            int cell_idx = idx % n_cells;
            ...
            idx += gridDim.x  blockDim.x;
        }
        
    • On CPU

      • Custom gridDim, blockDim

      • Other CUDA-like wrappers

History:

  • Already available for the Theano backend

  • Ported to TensorFlow

    • Directly support for all already prev. implemented ops (LSTM, Baum Welch aligner, …)

  • Easy to port to other frameworks

Examples:

  • NativeLstm (LstmGenericBase)

  • NativeLstm2

  • TwoDLSTM

  • FastBaumWelch

  • FastViterbi

  • OptimalCompletionEditDistance

  • EditDistance

  • Chunking, UnChunking

See also TensorFlow LSTM Benchmark.