Deterministic training¶
There are couple of TF operations which have a non-deterministic GPU implementation (for efficiency reasons), i.e. the result when executed on the GPU is non-deterministic. See also here.
Non-deterministic ops:
reduce_mean
,reduce_sum
(see here). Or now deterministic? (see here)convolutional ops (via cuDNN) can be (see here)
BiasAddGrad
(see here)…
E.g. however matmul
is deterministic. From the CUDA doc:
By design, all CUBLAS API routines from a given toolkit version, generate the same bit-wise results at every run when executed on GPUs with the same architecture and the same number of SMs. However, bit-wise reproducibility is not guaranteed across toolkit version because the implementation might differ due to some implementation changes.
The option deterministic_train
controls whether Returnn should use deterministic ops as far as possible.
So far this uses e.g. aggregation_method = tf.AggregationMethod.ADD_N
and not aggregation_method = tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N
for the TF optimizer.
We plan to extend this by replacing some of the non-deterministic ops by deterministic ones.