Deterministic training

There are couple of TF operations which have a non-deterministic GPU implementation (for efficiency reasons), i.e. the result when executed on the GPU is non-deterministic. See also here.

Non-deterministic ops:

E.g. however matmul is deterministic. From the CUDA doc:

By design, all CUBLAS API routines from a given toolkit version, generate the same bit-wise results at every run when executed on GPUs with the same architecture and the same number of SMs. However, bit-wise reproducibility is not guaranteed across toolkit version because the implementation might differ due to some implementation changes.

The option deterministic_train controls whether Returnn should use deterministic ops as far as possible. So far this uses e.g. aggregation_method = tf.AggregationMethod.ADD_N and not aggregation_method = tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N for the TF optimizer. We plan to extend this by replacing some of the non-deterministic ops by deterministic ones.