There are couple of TF operations which have a non-deterministic GPU implementation (for efficiency reasons), i.e. the result when executed on the GPU is non-deterministic. See also here.
convolutional ops (via cuDNN) can be (see here)
matmul is deterministic. From the CUDA doc:
By design, all CUBLAS API routines from a given toolkit version, generate the same bit-wise results at every run when executed on GPUs with the same architecture and the same number of SMs. However, bit-wise reproducibility is not guaranteed across toolkit version because the implementation might differ due to some implementation changes.
deterministic_train controls whether Returnn should use deterministic ops as far as possible.
So far this uses e.g.
aggregation_method = tf.AggregationMethod.ADD_N
aggregation_method = tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N for the TF optimizer.
We plan to extend this by replacing some of the non-deterministic ops by deterministic ones.