NativeOp

Generic interface which automatically creates: * CPU and GPU op * inplace and not inplace * grad variants

class NativeOp.NativeOpBaseMixin(in_info, out_info, c_fw_code, c_bw_code=None, c_extra_support_code=None, code_version=None, cpu_support=True, grad_input_map=None, name=None)[source]

The purpose of having this as a separate base class is to make this independent of any Theano specific functionality so that we can also use this base for example for TensorFlow.

Parameters:
  • in_info (list[dict(str)]) –

    each dict describes one input var. attribs in the dict:

    int ndim: the ndim. tuple shape: tuple and can contain None for specific dimensions.
    optional attribs:
    str dtype: “float32” by default. bool need_contiguous: false by default. int want_inplace: -1 by default. try to optimize to destroy input, on output-index.
    “dummy_out” is a special value which will add another output.

    bool is_inplace: false by default. whether the optimization was applied. str gradient: can be “disconnected”. see grad(). bool bw_input: True by default. add this param to the bw input.

    other attribs are just ignored.

  • out_info (list[dict(str)]) –

    like in_info. slightly different behavior for:

    shape: we also allow refs to the in_info in the form (in-idx,dim). see infer_shape(). need_contiguous/want_inplace: used for bw, in case for bw_input == True.
  • c_fw_code (str) – C code for forward pass
  • c_extra_support_code (str|dict[str]) – C support code (for c_support_code)
  • c_bw_code (str|None) – C code for backward pass (for gradient)
  • code_version (tuple[int]) – will be returned by c_code_cache_version.
  • cpu_support (bool) –
  • grad_input_map (tuple[int]|callable) – selection of grad inputs. by default, we get all inputs + all outputs + all grad outputs.
  • name (str) – name
infer_shape(node, input_shapes)[source]
kwargs_for_grad_op()[source]
Returns:the kwargs for creating a NativeOp for the gradient op. e.g. includes in_info, out_info, etc
Return type:dict[str]

Note: The inputs of the gradient are by default: fwd_op.inputs + fwd_op.outputs + output_grads. We filter them via self._filter_grad_inputs.

make_results_of_gradient(grad_op_outputs, disconnected_type=None)[source]
Parameters:
  • grad_op_outputs (list[T]) – this is already with dummy outputs removed
  • disconnected_type (S) –
Returns:

gradient for each input of our op

Return type:

list[T|S]

class NativeOp.NativeOp(custom_grad=None, **kwargs)[source]

We wrap some C code which can define a forward pass and optionally a backward pass (for gradient calculation). The C code should be Numpy and CUDA compatible. See NativeOp.cpp. We also support inplace operations, i.e. we can operate inplace on some inputs. You can define in a flexible way all the inputs and the outputs. See __init__() for the details.

All output variables are created automatically with the right shape
but their content is not initialized, except when its used by some input variable as the inplace output - in that case, it is either the input variable or it has a copy of its data.
Parameters:
  • custom_grad (function) – if given, will use this instead for self.grad
  • kwargs (dict[str]) – all passed to NativeOpBaseMixin
classmethod as_tensor_var(v)[source]
classmethod tensor_type(dtype, ndim)[source]
classmethod contiguous(v)[source]
grad(inputs, output_grads)[source]
connection_pattern(node)[source]
make_node(*args)[source]
perform(node, inputs, output_storage)[source]
c_code_cache_version()[source]
c_support_code()[source]
c_libraries()[source]
c_compile_args()[source]
c_lib_dirs()[source]
c_header_dirs()[source]
c_code(node, name, inputs, outputs, sub)[source]
class NativeOp.GpuNativeOp(custom_grad=None, **kwargs)[source]
Parameters:
  • custom_grad (function) – if given, will use this instead for self.grad
  • kwargs (dict[str]) – all passed to NativeOpBaseMixin
classmethod as_tensor_var(v)[source]
classmethod tensor_type(dtype, ndim)[source]
classmethod contiguous(v)[source]
c_support_code()[source]
class NativeOp.NativeOpGenBase[source]

Base interface for op generation. See NativeOp.__init__() for attribs.

in_info = None[source]
out_info = None[source]
c_fw_code = None[source]
c_bw_code = None[source]
c_extra_support_code = None[source]
code_version = None[source]
grad_input_map = None[source]
custom_grad = None[source]
cpu_support = True[source]
make_op()[source]
classmethod map_layer_inputs_to_op(*inputs)[source]
classmethod map_layer_output_from_op(*outputs)[source]
class NativeOp.LstmGenericBase[source]
inputs:
param Z:{input,output,forget} gate + cell state. 3d (time,batch,dim*4)
param V_h:recurrent matrix. 2d (dim,dim*4)
param c:initial cell state. 2d (batch,dim)
param i:index. 2d (time,batch) -> 0 or 1
outputs:
param Y:output. 3d (time,batch,dim)
param H:gates and cell state. 3d (time,batch,dim*4)
param d:final cell state. 2d (batch,dim)
in_info = ({'ndim': 3, 'bw_out_var': {'shape': ((2, 0), (2, 1), (0, 1))}, 'name': 'Z', 'need_contiguous': True, 'shape': (None, None, None), 'want_inplace': 1}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'V_h'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'c'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'need_contiguous': True, 'name': 'i'})[source]
out_info = ({'ndim': 3, 'shape': ((0, 0), (0, 1), (1, 0)), 'need_contiguous': True, 'name': 'Y', 'bw_grad_var': {'want_inplace': 'dummy_out'}}, {'ndim': 3, 'shape': ((0, 0), (0, 1), (0, 2)), 'need_contiguous': True, 'name': 'H', 'bw_in_var': {'want_inplace': 0}}, {'ndim': 2, 'shape': ((2, 0), (2, 1)), 'need_contiguous': True, 'name': 'd'})[source]
classmethod grad_input_map(Z, V_h, c, i, Y, H, d, DY, DH, Dd)[source]
classmethod map_layer_inputs_to_op(Z, V_h, i)[source]
c_extra_support_code = {'lstm_kernel': '\n DEF_KERNEL\n void lstm_kernel(float* data, const float* old_state, bool old_state_strided,\n float* output, float* state_out, int n_cells, int n_batch, const float* i) {\n //layout:\n //data[0*n_cells..1*n_cells-1] : input gate\n //data[1*n_cells..2*n_cells-1] : forget gate\n //data[2*n_cells..3*n_cells-1] : output gate\n //data[3*n_cells..4*n_cells-1] : cell state\n //output[0*n_cells..1*n_cells-1]: cell output\n //repeated for every mini-batch\n\n int idx = threadIdx.x + blockDim.x * blockIdx.x;\n while (idx < n_cells * n_batch) {\n int batch_idx = idx / n_cells;\n int start = batch_idx * 4 * n_cells + idx % n_cells;\n float i_batch = i[batch_idx];\n\n //input, forget and output gates\n float inpGate = 1.f / (1.f + expf(-data[start + n_cells]));\n float fgtGate = 1.f / (1.f + expf(-data[start + 2 * n_cells]));\n float outGate = 1.f / (1.f + expf(-data[start + 3 * n_cells]));\n float state = inpGate * tanhf(data[start]);\n float old_state_batch = old_state_strided ? old_state[start] : old_state[idx];\n\n state += fgtGate * old_state_batch;\n state = state * i_batch + old_state_batch * (1.f - i_batch);\n\n //cell output\n output[idx] = outGate * tanhf(state) * i_batch;\n\n data[start] = state;\n data[start + n_cells] = inpGate;\n data[start + 2 * n_cells] = fgtGate;\n data[start + 3 * n_cells] = outGate;\n if(state_out)\n state_out[idx] = state;\n\n idx += gridDim.x * blockDim.x;\n }\n }\n ', 'lstm_bwd_kernel': '\n DEF_KERNEL\n void lstm_bwd_kernel(\n float* delta, float* epsilon, const float* next_epsilon, const float* old_state,\n bool old_state_strided, const float* Y, int n_cells, int n_batch, const float* i) {\n //layout:\n //delta[0*n_cells..1*n_cells-1] : input gate\n //delta[1*n_cells..2*n_cells-1] : forget gate\n //delta[2*n_cells..3*n_cells-1] : output gate\n //delta[3*n_cells..4*n_cells-1] : cell state\n //epsilon[0*n_cells..1*n_cells-1]: cell output derivative (later overwritten, see below)\n //next_epsilon[0*n_cells..1*n_cells-1]: cell state derivative * forget_gate (of next timestep)\n //repeated for every mini-batch\n\n int idx = threadIdx.x + blockDim.x * blockIdx.x;\n while (idx < n_cells * n_batch) {\n int batch_idx = idx / n_cells;\n int batch_offset = batch_idx * 4 * n_cells;\n int cell_offset = idx % n_cells;\n int start = batch_offset + cell_offset;\n float i_batch = i[batch_idx];\n\n float inpGate = delta[start + n_cells];\n float fgtGate = delta[start + 2 * n_cells];\n float outGate = delta[start + 3 * n_cells];\n float oldState = old_state_strided ? old_state[start] : old_state[idx];\n float state = delta[start];\n float eps = epsilon[idx];\n\n //avoid division by 0\n float gc = tanhf(state); //g(c(t))\n float gzc = (state - fgtGate * oldState) / fmaxf(inpGate, float(1e-16)); //g(z_c(t))\n\n //delta_output\n delta[start + 3 * n_cells] = outGate * (1.f - outGate) * gc * eps * i_batch;\n\n //epsilon_c\n float epsilon_c = (1.f - (gc * gc)) * outGate * eps;\n epsilon_c += next_epsilon[idx];\n epsilon[idx] = epsilon_c * fgtGate * i_batch + next_epsilon[idx] * (1.f - i_batch);\n\n //delta_cell\n delta[start] = inpGate * (1.f - (gzc * gzc)) * epsilon_c * i_batch;\n\n //delta_forget\n delta[start + 2 * n_cells] = fgtGate * (1.f - fgtGate) * oldState * epsilon_c * i_batch;\n\n //delta_input\n delta[start + n_cells] = inpGate * (1.f - inpGate) * gzc * epsilon_c * i_batch;\n\n idx += gridDim.x * blockDim.x;\n }\n }\n '}[source]
c_fw_code = '\n // Z*, V_h, c, i = input_names (*: inplace)\n // Y, H, d = output_names\n assert(n_inputs == 4);\n assert(n_outputs == 3);\n Ndarray* V_h = inputs[1];\n Ndarray* c = inputs[2];\n Ndarray* i = inputs[3];\n Ndarray* Y = *outputs[0];\n Ndarray* H = *outputs[1]; // inplace on Z\n Ndarray* d = *outputs[2];\n\n long T = Ndarray_DIMS(i)[0];\n int n_batch = Ndarray_DIMS(i)[1];\n assert(Ndarray_DIMS(H)[2] %% 4 == 0); // 3 gates + cell\n int n_cells = Ndarray_DIMS(H)[2] / 4;\n\n assert(T > 0);\n for(int x = 0; x < T; ++x) {\n if(x > 0) {\n //H += Y[x-1]*V_h\n affine_y_x(x-1, Y, x, V_h, x, H);\n }\n\n start_dev_kernel(lstm_kernel, (\n data_ptr(H, x),\n x > 0 ? data_ptr(H, x - 1) : Ndarray_DEV_DATA(c),\n x > 0,\n data_ptr(Y, x),\n (x == T - 1) ? Ndarray_DEV_DATA(d) : 0,\n n_cells,\n n_batch,\n Ndarray_DEV_DATA(i) + x * n_batch\n ));\n }\n '[source]
c_bw_code = '\n // V_h, c, i, Y, H*, DY*, Dd = input_names (*: inplace)\n // DZ, DV_h, Dc, tmpDc = output_names\n assert(n_inputs == 7);\n assert(n_outputs == 4);\n Ndarray* V_h = inputs[0];\n Ndarray* c = inputs[1];\n Ndarray* i = inputs[2];\n Ndarray* Y = inputs[3];\n Ndarray* Dd = inputs[6];\n Ndarray* DZ = *outputs[0]; // inplace on H\n Ndarray* DV_h = *outputs[1];\n Ndarray* Dc = *outputs[2];\n Ndarray* tmpDc = *outputs[3]; // (old DY), inplace buffer\n\n long T = Ndarray_DIMS(i)[0];\n int n_batch = Ndarray_DIMS(i)[1];\n assert(Ndarray_DIMS(DZ)[2] %% 4 == 0); // 3 gates + cell\n int n_cells = Ndarray_DIMS(DZ)[2] / 4;\n\n assert(T > 0);\n for(int x = T - 1; x >= 0; --x) {\n // add recurrent\n bool rightBorder = (x == T - 1);\n if(!rightBorder)\n affine_y_x(x+1, DZ, x, V_h, x, tmpDc, false, true);\n\n start_dev_kernel(lstm_bwd_kernel, (\n data_ptr(DZ, x),\n data_ptr(tmpDc, x),\n rightBorder ? Ndarray_DEV_DATA(Dd) : data_ptr(tmpDc, x + 1),\n x > 0 ? data_ptr(DZ, x - 1) : Ndarray_DEV_DATA(c),\n x > 0,\n data_ptr(Y, x),\n n_cells,\n n_batch,\n Ndarray_DEV_DATA(i) + x * n_batch\n ));\n }\n\n //DV_h = Y[0..end-1]^T * DZ[1..end]\n affine_global(Y, DZ, DV_h, true, false, 1, 0.0f);\n\n const Ndarray_DIM_Type* Dc_dim = Ndarray_HOST_DIMS(Dc);\n Ndarray_memcpy(\n Ndarray_DEV_DATA(Dc), Ndarray_DEV_DATA(tmpDc),\n Dc_dim[0] * Dc_dim[1] * sizeof(float));\n\n '[source]
code_version = ()[source]
class NativeOp.LstmLowMem[source]
inputs:
param X:(time,batch,in_dim)
param W:forward+recurrent matrix. 2d (in_dim+dim,dim*4)
param b:bias. 1d (dim*4,)
param y0:initial output|hidden state. 2d (batch,dim)
param c0:initial cell state. 2d (batch,dim)
param i:index. 2d (time,batch) -> 0 or 1
param start:where to start. must be >=0, default is usually 0. dtype int, scalar.
param step:+1 for fwd, -1 for bwd direction. can also be |step|>1 for wider steps. dtype int, scalar. for bwd (<0), will start at T-start-1.
outputs:
param Y:output. 3d (time,batch,dim)
param C:cell states. 3d (time,batch,dim). gradient ignored!
param d:final cell state. 2d (batch,dim)
in_info = ({'ndim': 3, 'shape': (None, None, None), 'need_contiguous': True, 'name': 'X'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'W'}, {'ndim': 1, 'shape': (None,), 'need_contiguous': True, 'name': 'b'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'y0'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'c0'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'need_contiguous': True, 'name': 'i'}, {'ndim': 0, 'name': 'start', 'gradient': 'disconnected', 'dtype': 'int32', 'host_memory': True, 'shape': ()}, {'ndim': 0, 'name': 'step', 'gradient': 'disconnected', 'dtype': 'int32', 'host_memory': True, 'shape': ()})[source]
out_info = ({'ndim': 3, 'shape': ((0, 0), (0, 1), (4, 1)), 'need_contiguous': True, 'name': 'Y'}, {'ndim': 3, 'shape': ((0, 0), (0, 1), (4, 1)), 'need_contiguous': True, 'name': 'C'}, {'ndim': 2, 'shape': ((0, 1), (4, 1)), 'need_contiguous': True, 'name': 'd'})[source]
classmethod grad_input_map(X, W, b, y0, c0, i, start, step, Y, C, d, DY, DC, Dd)[source]
c_extra_support_code = {'lstm_kernel': '\n DEF_KERNEL\n void lstm_kernel(\n int n_batch, int n_cells, const float* mask,\n float* intern,\n float* prev_c,\n float* y,\n float* c)\n {\n int idx = threadIdx.x + blockDim.x * blockIdx.x;\n while (idx < n_cells * n_batch) {\n int batch_idx = idx / n_cells;\n int cell_idx = idx % n_cells;\n int intern_offset = batch_idx * 4 * n_cells + cell_idx;\n float prev_c_b = prev_c[idx];\n float mask_b = mask[batch_idx];\n\n // cell-in + input, forget and output gates\n float cellIn = tanhf(intern[intern_offset]);\n float inpGate = 1.f / (1.f + expf(-intern[intern_offset + n_cells]));\n float fgtGate = 1.f / (1.f + expf(-intern[intern_offset + 2 * n_cells]));\n float outGate = 1.f / (1.f + expf(-intern[intern_offset + 3 * n_cells]));\n\n float c_b = (prev_c_b * fgtGate + cellIn * inpGate) * mask_b\n + prev_c_b * (1.f - mask_b);\n c[idx] = c_b;\n y[idx] = tanhf(c_b) * outGate * mask_b;\n\n idx += gridDim.x * blockDim.x;\n }\n }\n ', 'copy_x_h_kernel': '\n DEF_KERNEL\n void copy_x_h_kernel(\n int n_batch, int n_in, int n_cells,\n float* x_h,\n float* x,\n float* h)\n {\n int n_total_in = n_in + n_cells;\n int idx = threadIdx.x + blockDim.x * blockIdx.x;\n while (idx < n_batch * n_total_in) {\n int batch_idx = idx / n_total_in;\n int in_dim_idx = idx % n_total_in;\n\n if(in_dim_idx < n_in)\n x_h[idx] = x[batch_idx * n_in + in_dim_idx];\n else\n x_h[idx] = h[batch_idx * n_cells + in_dim_idx - n_in];\n\n idx += gridDim.x * blockDim.x;\n }\n }\n ', 'lstm_bwd_kernel': '\n DEF_KERNEL\n void lstm_bwd_kernel(\n int n_batch, int n_in, int n_cells, const float* mask,\n float* x_h,\n float* intern,\n float* prev_c,\n float* y,\n float* c,\n float* d_y,\n float* d_h,\n float* d_c,\n float* d_intern,\n float* d_b)\n {\n int idx = threadIdx.x + blockDim.x * blockIdx.x;\n while (idx < n_cells * n_batch) {\n int batch_idx = idx / n_cells;\n int cell_idx = idx % n_cells;\n int intern_offset = batch_idx * 4 * n_cells + cell_idx;\n float mask_b = mask[batch_idx];\n float d_y_b = d_y[idx] * mask_b + d_h[idx];\n float d_c_b = d_c[idx] * mask_b;\n float prev_c_b = prev_c[idx];\n\n // cell-in + input, forget and output gates\n float cellIn = tanhf(intern[intern_offset]);\n float inpGate = 1.f / (1.f + expf(-intern[intern_offset + n_cells]));\n float fgtGate = 1.f / (1.f + expf(-intern[intern_offset + 2 * n_cells]));\n float outGate = 1.f / (1.f + expf(-intern[intern_offset + 3 * n_cells]));\n\n float c_b = prev_c_b * fgtGate + cellIn * inpGate;\n float gc = tanhf(c_b);\n float d_outGate_in = (1.f - outGate) * outGate * gc * d_y_b;\n float d_c2 = d_c_b + outGate * d_y_b * (1.f - gc * gc);\n float d_cellIn_in = (1.f - cellIn * cellIn) * inpGate * d_c2;\n float d_inpGate_in = (1.f - inpGate) * inpGate * cellIn * d_c2;\n float d_fgtGate_in = (1.f - fgtGate) * fgtGate * prev_c_b * d_c2;\n d_c[idx] = fgtGate * d_c2 + d_c[idx] * (1.f - mask_b);\n\n d_intern[intern_offset] = d_cellIn_in;\n d_intern[intern_offset + n_cells] = d_inpGate_in;\n d_intern[intern_offset + 2 * n_cells] = d_fgtGate_in;\n d_intern[intern_offset + 3 * n_cells] = d_outGate_in;\n\n elem_atomic_add(&d_b[cell_idx], d_cellIn_in);\n elem_atomic_add(&d_b[cell_idx + n_cells], d_inpGate_in);\n elem_atomic_add(&d_b[cell_idx + 2 * n_cells], d_fgtGate_in);\n elem_atomic_add(&d_b[cell_idx + 3 * n_cells], d_outGate_in);\n\n idx += gridDim.x * blockDim.x;\n }\n }\n ', 'inv_copy_x_h_kernel': '\n DEF_KERNEL\n void inv_copy_x_h_kernel(\n int n_batch, int n_in, int n_cells,\n float* x_h,\n float* x,\n float* h)\n {\n int n_total_in = n_in + n_cells;\n int idx = threadIdx.x + blockDim.x * blockIdx.x;\n while (idx < n_batch * n_total_in) {\n int batch_idx = idx / n_total_in;\n int in_dim_idx = idx % n_total_in;\n\n if(in_dim_idx < n_in)\n x[batch_idx * n_in + in_dim_idx] = x_h[idx];\n else\n h[batch_idx * n_cells + in_dim_idx - n_in] = x_h[idx];\n\n idx += gridDim.x * blockDim.x;\n }\n }\n ', 'add_bias_kernel': '\n DEF_KERNEL\n void add_bias_kernel(int n_batch, int n_dim, float* x, float* b) {\n int idx = threadIdx.x + blockDim.x * blockIdx.x;\n while (idx < n_batch * n_dim) {\n int dim_idx = idx % n_dim;\n x[idx] += b[dim_idx];\n idx += gridDim.x * blockDim.x;\n }\n }\n '}[source]
c_fw_code = '\n // X, W, b, y0, c0, i, start, step = input_names\n // Y, C, d = output_names\n assert(n_inputs == 8);\n assert(n_outputs == 3);\n Ndarray* X = inputs[0];\n Ndarray* W = inputs[1];\n Ndarray* b = inputs[2];\n Ndarray* y0 = inputs[3];\n Ndarray* c0 = inputs[4];\n Ndarray* i = inputs[5];\n assert_cmp(Ndarray_NDIM(inputs[6]), ==, 0);\n assert_cmp(Ndarray_NDIM(inputs[7]), ==, 0);\n int start = Ndarray_DEV_DATA_int32_scalar(inputs[6]);\n int step = Ndarray_DEV_DATA_int32_scalar(inputs[7]);\n Ndarray* Y = *outputs[0];\n Ndarray* C = *outputs[1];\n Ndarray* d = *outputs[2];\n\n assert_cmp(Ndarray_NDIM(X), ==, 3);\n assert_cmp(Ndarray_NDIM(W), ==, 2);\n assert_cmp(Ndarray_NDIM(b), ==, 1);\n assert_cmp(Ndarray_NDIM(y0), ==, 2);\n assert_cmp(Ndarray_NDIM(c0), ==, 2);\n assert_cmp(Ndarray_NDIM(i), ==, 2);\n assert_cmp(Ndarray_NDIM(Y), ==, 3);\n assert_cmp(Ndarray_NDIM(C), ==, 3);\n assert_cmp(Ndarray_NDIM(d), ==, 2);\n long T = Ndarray_DIMS(i)[0];\n int n_batch = Ndarray_DIMS(i)[1];\n int n_cells = Ndarray_DIMS(y0)[1];\n int n_in = Ndarray_DIMS(X)[2];\n assert_cmp(Ndarray_DIMS(X)[0], ==, T);\n assert_cmp(Ndarray_DIMS(X)[1], ==, n_batch);\n assert_cmp(Ndarray_DIMS(W)[0], ==, n_in + n_cells);\n assert_cmp(Ndarray_DIMS(W)[1], ==, n_cells * 4);\n assert_cmp(Ndarray_DIMS(b)[0], ==, n_cells * 4);\n assert_cmp(Ndarray_DIMS(y0)[0], ==, n_batch);\n assert_cmp(Ndarray_DIMS(y0)[1], ==, n_cells);\n assert_cmp(Ndarray_DIMS(c0)[0], ==, n_batch);\n assert_cmp(Ndarray_DIMS(c0)[1], ==, n_cells);\n assert_cmp(Ndarray_DIMS(Y)[0], ==, T);\n assert_cmp(Ndarray_DIMS(Y)[1], ==, n_batch);\n assert_cmp(Ndarray_DIMS(Y)[2], ==, n_cells);\n assert_cmp(Ndarray_DIMS(C)[0], ==, T);\n assert_cmp(Ndarray_DIMS(C)[1], ==, n_batch);\n assert_cmp(Ndarray_DIMS(C)[2], ==, n_cells);\n assert_cmp(Ndarray_DIMS(d)[0], ==, n_batch);\n assert_cmp(Ndarray_DIMS(d)[1], ==, n_cells);\n\n float* x_h = (float*) device_malloc(n_batch * (n_in + n_cells) * sizeof(float));\n float* intern = (float*) device_malloc(n_batch * n_cells * 4 * sizeof(float)); // 3 gates + in\n\n assert_cmp(T, >, 0);\n assert_cmp(start, >=, 0);\n assert_cmp(start, <, T);\n assert_cmp(step, !=, 0);\n int end = T - 1;\n if(step < 0) {\n end = start;\n start = T - start - 1;\n }\n int t = start;\n for(; (step > 0) ? (t <= end) : (t >= end); t += step) {\n // x_h = X[t], Y[t-1]\n start_dev_kernel(copy_x_h_kernel,\n (n_batch, n_in, n_cells, x_h, data_ptr(X, t), (t != start) ? data_ptr(Y, t-step) : Ndarray_DEV_DATA(y0)));\n // intern = x_h * W\n affine_raw(\n x_h, n_batch, n_in + n_cells,\n Ndarray_DEV_DATA(W), n_in + n_cells, n_cells * 4,\n intern, n_batch, n_cells * 4,\n false, false, 0.0);\n // intern += b\n start_dev_kernel(add_bias_kernel, (\n n_batch, n_cells * 4, intern, Ndarray_DEV_DATA(b)));\n\n start_dev_kernel(lstm_kernel, (\n n_batch,\n n_cells,\n Ndarray_DEV_DATA(i) + t * n_batch,\n intern,\n (t != start) ? data_ptr(C, t-step) : Ndarray_DEV_DATA(c0),\n data_ptr(Y, t), // out\n data_ptr(C, t) // out\n ));\n }\n\n device_free(x_h);\n device_free(intern);\n\n Ndarray_memcpy(Ndarray_DEV_DATA(d), data_ptr(C, t - step), n_batch * n_cells * sizeof(float));\n '[source]
c_bw_code = '\n // X, W, b, y0, c0, i, start, step, Y, C, DY, Dd = input_names\n // DX, DW, Db, Dh, Dc = output_names\n assert(n_inputs == 12);\n assert(n_outputs == 5);\n Ndarray* X = inputs[0];\n Ndarray* W = inputs[1];\n Ndarray* b = inputs[2];\n Ndarray* y0 = inputs[3];\n Ndarray* c0 = inputs[4];\n Ndarray* i = inputs[5];\n assert_cmp(Ndarray_NDIM(inputs[6]), ==, 0);\n assert_cmp(Ndarray_NDIM(inputs[7]), ==, 0);\n int start = Ndarray_DEV_DATA_int32_scalar(inputs[6]);\n int step = Ndarray_DEV_DATA_int32_scalar(inputs[7]);\n Ndarray* Y = inputs[8];\n Ndarray* C = inputs[9];\n Ndarray* DY = inputs[10];\n Ndarray* Dd = inputs[11];\n Ndarray* DX = *outputs[0];\n Ndarray* DW = *outputs[1];\n Ndarray* Db = *outputs[2];\n Ndarray* Dh = *outputs[3];\n Ndarray* Dc = *outputs[4];\n\n assert_cmp(Ndarray_NDIM(X), ==, 3);\n assert_cmp(Ndarray_NDIM(W), ==, 2);\n assert_cmp(Ndarray_NDIM(b), ==, 1);\n assert_cmp(Ndarray_NDIM(y0), ==, 2);\n assert_cmp(Ndarray_NDIM(c0), ==, 2);\n assert_cmp(Ndarray_NDIM(i), ==, 2);\n assert_cmp(Ndarray_NDIM(Y), ==, 3);\n assert_cmp(Ndarray_NDIM(C), ==, 3);\n assert_cmp(Ndarray_NDIM(DY), ==, 3);\n assert_cmp(Ndarray_NDIM(Dd), ==, 2);\n assert_cmp(Ndarray_NDIM(DX), ==, 3);\n assert_cmp(Ndarray_NDIM(DW), ==, 2);\n assert_cmp(Ndarray_NDIM(Db), ==, 1);\n assert_cmp(Ndarray_NDIM(Dh), ==, 2);\n assert_cmp(Ndarray_NDIM(Dc), ==, 2);\n long T = Ndarray_DIMS(i)[0];\n int n_batch = Ndarray_DIMS(i)[1];\n int n_cells = Ndarray_DIMS(y0)[1];\n int n_in = Ndarray_DIMS(X)[2];\n assert_cmp(Ndarray_DIMS(X)[0], ==, T);\n assert_cmp(Ndarray_DIMS(X)[1], ==, n_batch);\n assert_cmp(Ndarray_DIMS(W)[0], ==, n_in + n_cells);\n assert_cmp(Ndarray_DIMS(W)[1], ==, n_cells * 4);\n assert_cmp(Ndarray_DIMS(b)[0], ==, n_cells * 4);\n assert_cmp(Ndarray_DIMS(y0)[0], ==, n_batch);\n assert_cmp(Ndarray_DIMS(y0)[1], ==, n_cells);\n assert_cmp(Ndarray_DIMS(c0)[0], ==, n_batch);\n assert_cmp(Ndarray_DIMS(c0)[1], ==, n_cells);\n assert_cmp(Ndarray_DIMS(Y)[0], ==, T);\n assert_cmp(Ndarray_DIMS(Y)[1], ==, n_batch);\n assert_cmp(Ndarray_DIMS(Y)[2], ==, n_cells);\n assert_cmp(Ndarray_DIMS(C)[0], ==, T);\n assert_cmp(Ndarray_DIMS(C)[1], ==, n_batch);\n assert_cmp(Ndarray_DIMS(C)[2], ==, n_cells);\n assert_cmp(Ndarray_DIMS(DY)[0], ==, T);\n assert_cmp(Ndarray_DIMS(DY)[1], ==, n_batch);\n assert_cmp(Ndarray_DIMS(DY)[2], ==, n_cells);\n assert_cmp(Ndarray_DIMS(Dd)[0], ==, n_batch);\n assert_cmp(Ndarray_DIMS(Dd)[1], ==, n_cells);\n assert_cmp(Ndarray_DIMS(DX)[0], ==, T);\n assert_cmp(Ndarray_DIMS(DX)[1], ==, n_batch);\n assert_cmp(Ndarray_DIMS(DX)[2], ==, n_in);\n assert_cmp(Ndarray_DIMS(DW)[0], ==, n_in + n_cells);\n assert_cmp(Ndarray_DIMS(DW)[1], ==, n_cells * 4);\n assert_cmp(Ndarray_DIMS(Db)[0], ==, n_cells * 4);\n assert_cmp(Ndarray_DIMS(Dh)[0], ==, n_batch);\n assert_cmp(Ndarray_DIMS(Dh)[1], ==, n_cells);\n assert_cmp(Ndarray_DIMS(Dc)[0], ==, n_batch);\n assert_cmp(Ndarray_DIMS(Dc)[1], ==, n_cells);\n\n float* x_h = (float*) device_malloc(n_batch * (n_in + n_cells) * sizeof(float));\n float* intern = (float*) device_malloc(n_batch * n_cells * 4 * sizeof(float)); // 3 gates + in\n float* Dx_h = (float*) device_malloc(n_batch * (n_in + n_cells) * sizeof(float));\n float* Dintern = (float*) device_malloc(n_batch * n_cells * 4 * sizeof(float)); // 3 gates + in\n\n // We will work inplace on DX/DW/Db.\n Ndarray_memset(Ndarray_DEV_DATA(DX), 0, T * n_batch * n_in * sizeof(float));\n Ndarray_memset(Ndarray_DEV_DATA(DW), 0, (n_in + n_cells) * n_cells * 4 * sizeof(float));\n Ndarray_memset(Ndarray_DEV_DATA(Db), 0, n_cells * 4 * sizeof(float));\n // We will work inplace on Dh.\n Ndarray_memset(Ndarray_DEV_DATA(Dh), 0, n_batch * n_cells * sizeof(float));\n // We will work inplace on Dc, and init it with Dd.\n Ndarray_memcpy(Ndarray_DEV_DATA(Dc), Ndarray_DEV_DATA(Dd), n_batch * n_cells * sizeof(float));\n\n assert_cmp(T, >, 0);\n assert_cmp(start, >=, 0);\n assert_cmp(start, <, T);\n assert_cmp(step, !=, 0);\n int end = T - 1;\n if(step < 0) {\n end = start;\n start = T - start - 1;\n }\n int t = end; // go backwards\n for(; (step > 0) ? (t >= start) : (t <= start); t -= step) {\n bool right = (step > 0) ? (t - step >= start) : (t - step <= start);\n\n // TODO: correct handling of mask in grad, fwd, initial cell,hidden, etc\n // x_h = X[t], Y[t-1]\n start_dev_kernel(copy_x_h_kernel,\n (n_batch, n_in, n_cells,\n x_h, data_ptr(X, t), right ? data_ptr(Y, t-step) : Ndarray_DEV_DATA(y0)));\n\n // intern = x_h * W\n affine_raw(\n x_h, n_batch, n_in + n_cells,\n Ndarray_DEV_DATA(W), n_in + n_cells, n_cells * 4,\n intern, n_batch, n_cells * 4,\n false, false, 0.0);\n // intern += b\n start_dev_kernel(add_bias_kernel, (\n n_batch, n_cells * 4, intern, Ndarray_DEV_DATA(b)));\n\n start_dev_kernel(lstm_bwd_kernel, (\n n_batch,\n n_in,\n n_cells,\n Ndarray_DEV_DATA(i) + t * n_batch,\n x_h,\n intern,\n right ? data_ptr(C, t-step) : Ndarray_DEV_DATA(c0),\n data_ptr(Y, t),\n data_ptr(C, t),\n data_ptr(DY, t),\n Ndarray_DEV_DATA(Dh), // error from prev frame, excluding DY. updated below\n Ndarray_DEV_DATA(Dc), // in+out, working inplace. also error from prev frame, initially Dd\n Dintern, // out\n Ndarray_DEV_DATA(Db) // out\n ));\n\n // Dx_h = Dintern * W^T\n affine_raw(\n Dintern, n_batch, n_cells * 4,\n Ndarray_DEV_DATA(W), n_in + n_cells, n_cells * 4,\n Dx_h, n_batch, n_in + n_cells,\n false, true, 0.0);\n\n // DW += x_h^T * Dintern\n affine_raw(\n x_h, n_batch, n_in + n_cells,\n Dintern, n_batch, n_cells * 4,\n Ndarray_DEV_DATA(DW), n_in + n_cells, n_cells * 4,\n true, false);\n\n // DX[t], Dh = Dx_h\n start_dev_kernel(inv_copy_x_h_kernel,\n (n_batch, n_in, n_cells, Dx_h, data_ptr(DX, t), Ndarray_DEV_DATA(Dh)));\n }\n\n device_free(x_h);\n device_free(intern);\n device_free(Dx_h);\n device_free(Dintern);\n '[source]
class NativeOp.Chunking[source]

Given an input in 3d (n_time,n_batch,n_dim), we chunk up the time dimension in chunks of size chunk_size, every chunk_step frames. This results in an 3d output (chunk_size, n_batch * n_chunks, n_dim) where n_chunks = floor( max(n_time - chunk_size + chunk_step - 1, 0) / chunk_step ) + 1. Examples:

n_time=1, chunk_size=50, chunk_step=10 -> n_chunks=1 n_time=49, chunk_size=50, chunk_step=10 -> n_chunks=1 n_time=50, chunk_size=50, chunk_step=10 -> n_chunks=1 n_time=51, chunk_size=50, chunk_step=10 -> n_chunks=2 n_time=60, chunk_size=50, chunk_step=10 -> n_chunks=2 n_time=61, chunk_size=50, chunk_step=10 -> n_chunks=3 n_time=99, chunk_size=50, chunk_step=10 -> n_chunks=6 n_time=100, chunk_size=50, chunk_step=10 -> n_chunks=6 n_time=101, chunk_size=50, chunk_step=10 -> n_chunks=7
in_info = ({'ndim': 3, 'shape': (None, None, None), 'name': 'input'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'name': 'index'}, {'ndim': 3, 'gradient': 'disconnected', 'shape': (None, None, None), 'name': 'output_buffer', 'want_inplace': 0}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'name': 'oindex_buffer', 'want_inplace': 1}, {'ndim': 1, 'gradient': 'disconnected', 'shape': (2,), 'need_contiguous': True, 'name': 'chunk_params'})[source]
out_info = ({'ndim': 3, 'shape': ((2, 0), (2, 1), (2, 2)), 'name': 'output'}, {'ndim': 2, 'shape': ((3, 0), (3, 1)), 'name': 'oindex'})[source]
c_extra_support_code = {'copy_kernel': '\n DEF_KERNEL\n void copy_kernel(\n float* chunk_params,\n float* input, long in_dim0, long in_dim1, long in_dim2, long in_stride0, long in_stride1, long in_stride2,\n float* index, long idx_stride0, long idx_stride1,\n float* output, long out_dim0, long out_dim1, long out_stride0, long out_stride1, long out_stride2,\n float* oindex, long oidx_stride0, long oidx_stride1\n ) {\n assert_cmp(out_dim1 % in_dim1, ==, 0);\n const long n_chunks = out_dim1 / in_dim1;\n assert_cmp(n_chunks, >, 0);\n const long chunk_size = out_dim0;\n assert_cmp(long(chunk_params[0]), ==, chunk_size);\n const long chunk_step = long(chunk_params[1]);\n assert_cmp(chunk_step, >, 0);\n assert_cmp(chunk_step * (n_chunks - 1) + chunk_size, >=, in_dim0);\n assert_cmp(chunk_step * (n_chunks - 1), <, in_dim0);\n\n // Iterate over output (chunked) x/y coordinates.\n // In an inner loop, we will loop over z.\n const long max_idx = out_dim0 * out_dim1;\n for(\n long idx = threadIdx.x + blockDim.x * blockIdx.x;\n idx < max_idx;\n idx += gridDim.x * blockDim.x)\n {\n long out_x = idx % out_dim0; // time\n long out_y = idx / out_dim0; // batch\n\n long chunk_idx = out_y % n_chunks;\n long in_y = out_y / n_chunks;\n\n long in_x = chunk_step * chunk_idx + out_x;\n\n if(in_x < in_dim0 && index[in_x * idx_stride0 + in_y * idx_stride1] > 0.1) {\n for(long z = 0; z < in_dim2; ++z)\n output[out_x * out_stride0 + out_y * out_stride1 + z * out_stride2] =\n input[in_x * in_stride0 + in_y * in_stride1 + z * in_stride2];\n oindex[out_x * oidx_stride0 + out_y * oidx_stride1] = 1;\n }\n else {\n for(long z = 0; z < in_dim2; ++z)\n output[out_x * out_stride0 + out_y * out_stride1 + z * out_stride2] = 0;\n oindex[out_x * oidx_stride0 + out_y * oidx_stride1] = 0;\n }\n }\n }\n '}[source]
c_fw_code = '\n assert_cmp(n_inputs, ==, 5);\n assert_cmp(n_outputs, ==, 2);\n Ndarray* input = inputs[0];\n Ndarray* index = inputs[1];\n Ndarray* chunk_params = inputs[4];\n Ndarray* output = *outputs[0];\n Ndarray* oindex = *outputs[1];\n\n assert_cmp(Ndarray_NDIM(input), ==, 3);\n assert_cmp(Ndarray_NDIM(index), ==, 2);\n assert_cmp(Ndarray_DIMS(input)[0], ==, Ndarray_DIMS(index)[0]);\n assert_cmp(Ndarray_DIMS(input)[1], ==, Ndarray_DIMS(index)[1]);\n assert_cmp(Ndarray_NDIM(chunk_params), ==, 1);\n assert_cmp(Ndarray_DIMS(chunk_params)[0], ==, 2);\n assert_cmp(Ndarray_NDIM(output), ==, 3);\n assert_cmp(Ndarray_NDIM(oindex), ==, 2);\n assert_cmp(Ndarray_DIMS(output)[0], ==, Ndarray_DIMS(oindex)[0]);\n assert_cmp(Ndarray_DIMS(output)[1], ==, Ndarray_DIMS(oindex)[1]);\n assert_cmp(Ndarray_DIMS(output)[2], ==, Ndarray_DIMS(input)[2]);\n\n start_dev_kernel(copy_kernel, (\n Ndarray_DEV_DATA(chunk_params),\n Ndarray_DEV_DATA(input),\n Ndarray_DIMS(input)[0],\n Ndarray_DIMS(input)[1],\n Ndarray_DIMS(input)[2],\n Ndarray_STRIDE(input, 0),\n Ndarray_STRIDE(input, 1),\n Ndarray_STRIDE(input, 2),\n Ndarray_DEV_DATA(index),\n Ndarray_STRIDE(index, 0),\n Ndarray_STRIDE(index, 1),\n Ndarray_DEV_DATA(output),\n Ndarray_DIMS(output)[0],\n Ndarray_DIMS(output)[1],\n Ndarray_STRIDE(output, 0),\n Ndarray_STRIDE(output, 1),\n Ndarray_STRIDE(output, 2),\n Ndarray_DEV_DATA(oindex),\n Ndarray_STRIDE(oindex, 0),\n Ndarray_STRIDE(oindex, 1)\n ));\n '[source]
code_version = ()[source]
static naive_chunk_start_frames(n_time, chunk_size, chunk_step)[source]

This is just for documentation / demonstration. Also used by testing code.

classmethod custom_grad(op, inputs, output_grads)[source]
NativeOp.chunk(x, index, chunk_size, chunk_step)[source]
class NativeOp.UnChunking[source]

This reverses the output from Chunking, i.e. chunking the time dimension. We get a 3d input (chunk_size, n_batch * n_chunks, n_dim) and return an 3d output (n_time, n_batch, n_dim) where the chunks are of size chunk_size, every chunk_step frames. Because of overlaps, we have to combine the overlapping chunks somehow. We will do that with a uniform distribution, i.e. take the mean of all overlaps per frame.

in_info = ({'ndim': 3, 'shape': (None, None, None), 'name': 'input'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'name': 'index'}, {'ndim': 3, 'gradient': 'disconnected', 'shape': (None, None, None), 'name': 'output_buffer', 'want_inplace': 0}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'name': 'oindex_buffer', 'want_inplace': 1}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'name': 'ofactors_buffer', 'want_inplace': 2}, {'ndim': 1, 'gradient': 'disconnected', 'shape': (2,), 'need_contiguous': True, 'name': 'chunk_params'})[source]
out_info = ({'ndim': 3, 'shape': ((2, 0), (2, 1), (2, 2)), 'name': 'output'}, {'ndim': 2, 'shape': ((3, 0), (3, 1)), 'name': 'oindex'}, {'ndim': 2, 'shape': ((4, 0), (4, 1)), 'name': 'ofactors'})[source]
c_extra_support_code = {'unchunk_kernel': '\n DEF_KERNEL\n void unchunk_kernel(\n float* chunk_params,\n float* input, long in_dim0, long in_dim1, long in_dim2, long in_stride0, long in_stride1, long in_stride2,\n float* index, long idx_stride0, long idx_stride1,\n float* output, long out_dim0, long out_dim1, long out_stride0, long out_stride1, long out_stride2,\n float* oindex, long oidx_stride0, long oidx_stride1,\n float* ofactors, long ofac_stride0, long ofac_stride1\n ) {\n assert_cmp(in_dim1 % out_dim1, ==, 0);\n const long n_chunks = in_dim1 / out_dim1;\n assert_cmp(n_chunks, >, 0);\n const long chunk_size = in_dim0;\n assert_cmp(long(chunk_params[0]), ==, chunk_size);\n const long chunk_step = long(chunk_params[1]);\n assert_cmp(chunk_step, >, 0);\n assert_cmp(chunk_step * (n_chunks - 1) + chunk_size, >=, out_dim0);\n assert_cmp(chunk_step * (n_chunks - 1), <, out_dim0);\n\n // Iterate over output (unchunked) x/y coordinates.\n // In an inner loop, we will loop over z.\n const long max_idx = out_dim0 * out_dim1;\n for(\n long idx = threadIdx.x + blockDim.x * blockIdx.x;\n idx < max_idx;\n idx += gridDim.x * blockDim.x)\n {\n long out_x = idx % out_dim0; // time\n long out_y = idx / out_dim0; // batch\n\n float c = 0;\n for(long z = 0; z < in_dim2; ++z)\n output[out_x * out_stride0 + out_y * out_stride1 + z * out_stride2] = 0;\n\n // in_x = out_x - chunk_step * chunk_idx,\n // thus in_x < 0 when chunk_idx * chunk_step > out_x,\n // and in_x >= chunk_size when chunk_idx * chunk_step <= out_x - chunk_size,\n // thus we need chunk_idx <= out_x / chunk_step,\n // and chunk_idx > (out_x - chunk_size) / chunk_step.\n // Examples:\n // out_x=0, chunk_size=10, chunk_step=4 -> chunk_idx_start,end=0,1\n // out_x=3, chunk_size=10, chunk_step=4 -> chunk_idx_start,end=0,1\n // out_x=4, chunk_size=10, chunk_step=4 -> chunk_idx_start,end=0,2\n // out_x=7, chunk_size=10, chunk_step=4 -> chunk_idx_start,end=0,2\n // out_x=8, chunk_size=10, chunk_step=4 -> chunk_idx_start,end=0,3\n // out_x=9, chunk_size=10, chunk_step=4 -> chunk_idx_start,end=0,3\n // out_x=10, chunk_size=10, chunk_step=4 -> chunk_idx_start,end=1,3\n // out_x=11, chunk_size=10, chunk_step=4 -> chunk_idx_start,end=1,3\n // out_x=12, chunk_size=10, chunk_step=4 -> chunk_idx_start,end=1,4\n // out_x=13, chunk_size=10, chunk_step=4 -> chunk_idx_start,end=1,4\n // out_x=14, chunk_size=10, chunk_step=4 -> chunk_idx_start,end=2,4\n long chunk_idx_start = (out_x - chunk_size + chunk_step) / chunk_step;\n if(chunk_idx_start < 0) chunk_idx_start = 0;\n long chunk_idx_end = out_x / chunk_step + 1;\n if(chunk_idx_end > n_chunks) chunk_idx_end = n_chunks;\n assert_cmp(chunk_idx_start, <, chunk_idx_end);\n for(long chunk_idx = chunk_idx_start; chunk_idx < chunk_idx_end; ++chunk_idx) {\n long in_y = out_y * n_chunks + chunk_idx;\n long in_x = out_x - chunk_step * chunk_idx;\n assert_cmp(in_x, >=, 0);\n assert_cmp(in_x, <, chunk_size);\n if(index[in_x * idx_stride0 + in_y * idx_stride1] > 0.1) {\n c += 1;\n for(long z = 0; z < in_dim2; ++z)\n output[out_x * out_stride0 + out_y * out_stride1 + z * out_stride2] +=\n input[in_x * in_stride0 + in_y * in_stride1 + z * in_stride2];\n }\n }\n\n if(c > 0.1) {\n for(long z = 0; z < in_dim2; ++z)\n output[out_x * out_stride0 + out_y * out_stride1 + z * out_stride2] /= c;\n oindex[out_x * oidx_stride0 + out_y * oidx_stride1] = 1;\n ofactors[out_x * ofac_stride0 + out_y * ofac_stride1] = 1.0 / c;\n } else {\n oindex[out_x * oidx_stride0 + out_y * oidx_stride1] = 0;\n ofactors[out_x * ofac_stride0 + out_y * ofac_stride1] = 1.0;\n }\n }\n }\n '}[source]
c_fw_code = '\n assert_cmp(n_inputs, ==, 6);\n assert_cmp(n_outputs, ==, 3);\n Ndarray* input = inputs[0];\n Ndarray* index = inputs[1];\n Ndarray* chunk_params = inputs[5];\n Ndarray* output = *outputs[0];\n Ndarray* oindex = *outputs[1];\n Ndarray* ofactors = *outputs[2];\n\n assert_cmp(Ndarray_NDIM(input), ==, 3);\n assert_cmp(Ndarray_NDIM(index), ==, 2);\n assert_cmp(Ndarray_DIMS(input)[0], ==, Ndarray_DIMS(index)[0]);\n assert_cmp(Ndarray_DIMS(input)[1], ==, Ndarray_DIMS(index)[1]);\n assert_cmp(Ndarray_NDIM(chunk_params), ==, 1);\n assert_cmp(Ndarray_DIMS(chunk_params)[0], ==, 2);\n assert_cmp(Ndarray_NDIM(output), ==, 3);\n assert_cmp(Ndarray_NDIM(oindex), ==, 2);\n assert_cmp(Ndarray_NDIM(ofactors), ==, 2);\n assert_cmp(Ndarray_DIMS(output)[0], ==, Ndarray_DIMS(oindex)[0]);\n assert_cmp(Ndarray_DIMS(output)[1], ==, Ndarray_DIMS(oindex)[1]);\n assert_cmp(Ndarray_DIMS(output)[2], ==, Ndarray_DIMS(input)[2]);\n assert_cmp(Ndarray_DIMS(oindex)[0], ==, Ndarray_DIMS(ofactors)[0]);\n assert_cmp(Ndarray_DIMS(oindex)[1], ==, Ndarray_DIMS(ofactors)[1]);\n\n start_dev_kernel(unchunk_kernel, (\n Ndarray_DEV_DATA(chunk_params),\n Ndarray_DEV_DATA(input),\n Ndarray_DIMS(input)[0],\n Ndarray_DIMS(input)[1],\n Ndarray_DIMS(input)[2],\n Ndarray_STRIDE(input, 0),\n Ndarray_STRIDE(input, 1),\n Ndarray_STRIDE(input, 2),\n Ndarray_DEV_DATA(index),\n Ndarray_STRIDE(index, 0),\n Ndarray_STRIDE(index, 1),\n Ndarray_DEV_DATA(output),\n Ndarray_DIMS(output)[0],\n Ndarray_DIMS(output)[1],\n Ndarray_STRIDE(output, 0),\n Ndarray_STRIDE(output, 1),\n Ndarray_STRIDE(output, 2),\n Ndarray_DEV_DATA(oindex),\n Ndarray_STRIDE(oindex, 0),\n Ndarray_STRIDE(oindex, 1),\n Ndarray_DEV_DATA(ofactors),\n Ndarray_STRIDE(ofactors, 0),\n Ndarray_STRIDE(ofactors, 1)\n ));\n '[source]
code_version = ()[source]
classmethod custom_grad(op, inputs, output_grads)[source]
NativeOp.unchunk(x, index, chunk_size, chunk_step, n_time, n_batch)[source]
class NativeOp.SubtensorBatchedIndex[source]
Consider you have:
idx: 2d (n_time, n_batch) -> idx (in [0..n_dim-1]) x: 3d (n_time, n_batch, n_dim)
Then, this op will calculate:
x[..., idx[...]]: 2d (n_time, n_batch)
in_info = ({'ndim': 3, 'shape': (None, None, None), 'name': 'x', 'bw_in_var': {'want_inplace': 0}}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'name': 'idx'})[source]
out_info = ({'ndim': 2, 'shape': ((0, 0), (0, 1)), 'name': 'y'},)[source]
classmethod grad_input_map(x, idx, y, DY)[source]
c_extra_support_code = {'select_kernel': '\n DEF_KERNEL\n void select_kernel(\n float* x, long x_dim0, long x_dim1, long x_dim2, long x_stride0, long x_stride1, long x_stride2,\n float* index, long idx_stride0, long idx_stride1,\n float* y, long y_stride0, long y_stride1\n ) {\n const long max_idx = x_dim0 * x_dim1;\n for(\n long idx = threadIdx.x + blockDim.x * blockIdx.x;\n idx < max_idx;\n idx += gridDim.x * blockDim.x)\n {\n long d0 = idx % x_dim0;\n long d1 = idx / x_dim0;\n long d2 = long(index[d0 * idx_stride0 + d1 * idx_stride1]);\n if(d2 < 0) d2 = 0;\n if(d2 >= x_dim2) d2 = x_dim2 - 1;\n y[d0 * y_stride0 + d1 * y_stride1] = x[d0 * x_stride0 + d1 * x_stride1 + d2 * x_stride2];\n }\n }\n ', 'select_bw_kernel': '\n DEF_KERNEL\n void select_bw_kernel(\n float* Dx, long Dx_dim0, long Dx_dim1, long Dx_dim2, long Dx_stride0, long Dx_stride1, long Dx_stride2,\n float* index, long idx_stride0, long idx_stride1,\n float* Dy, long Dy_stride0, long Dy_stride1\n ) {\n const long max_idx = Dx_dim0 * Dx_dim1;\n for(\n long idx = threadIdx.x + blockDim.x * blockIdx.x;\n idx < max_idx;\n idx += gridDim.x * blockDim.x)\n {\n long d0 = idx % Dx_dim0;\n long d1 = idx / Dx_dim0;\n long d2 = long(index[d0 * idx_stride0 + d1 * idx_stride1]);\n if(d2 < 0) d2 = 0;\n if(d2 >= Dx_dim2) d2 = Dx_dim2 - 1;\n Dx[d0 * Dx_stride0 + d1 * Dx_stride1 + d2 * Dx_stride2] = Dy[d0 * Dy_stride0 + d1 * Dy_stride1];\n }\n }\n '}[source]
c_fw_code = '\n assert_cmp(n_inputs, ==, 2);\n assert_cmp(n_outputs, ==, 1);\n Ndarray* x = inputs[0];\n Ndarray* idx = inputs[1];\n Ndarray* y = *outputs[0];\n\n assert_cmp(Ndarray_NDIM(x), ==, 3);\n assert_cmp(Ndarray_NDIM(idx), ==, 2);\n assert_cmp(Ndarray_DIMS(x)[0], ==, Ndarray_DIMS(idx)[0]);\n assert_cmp(Ndarray_DIMS(x)[1], ==, Ndarray_DIMS(idx)[1]);\n assert_cmp(Ndarray_NDIM(y), ==, 2);\n assert_cmp(Ndarray_DIMS(y)[0], ==, Ndarray_DIMS(idx)[0]);\n assert_cmp(Ndarray_DIMS(y)[1], ==, Ndarray_DIMS(idx)[1]);\n\n start_dev_kernel(select_kernel, (\n Ndarray_DEV_DATA(x),\n Ndarray_DIMS(x)[0],\n Ndarray_DIMS(x)[1],\n Ndarray_DIMS(x)[2],\n Ndarray_STRIDE(x, 0),\n Ndarray_STRIDE(x, 1),\n Ndarray_STRIDE(x, 2),\n Ndarray_DEV_DATA(idx),\n Ndarray_STRIDE(idx, 0),\n Ndarray_STRIDE(idx, 1),\n Ndarray_DEV_DATA(y),\n Ndarray_STRIDE(y, 0),\n Ndarray_STRIDE(y, 1)\n ));\n '[source]
c_bw_code = '\n assert_cmp(n_inputs, ==, 3);\n assert_cmp(n_outputs, ==, 1);\n Ndarray* x = inputs[0];\n Ndarray* idx = inputs[1];\n Ndarray* Dy = inputs[2];\n Ndarray* Dx = *outputs[0]; // inplace on x\n\n assert_cmp(Ndarray_NDIM(x), ==, 3);\n assert_cmp(Ndarray_NDIM(idx), ==, 2);\n assert_cmp(Ndarray_DIMS(x)[0], ==, Ndarray_DIMS(idx)[0]);\n assert_cmp(Ndarray_DIMS(x)[1], ==, Ndarray_DIMS(idx)[1]);\n assert_cmp(Ndarray_NDIM(Dy), ==, 2);\n assert_cmp(Ndarray_DIMS(Dy)[0], ==, Ndarray_DIMS(idx)[0]);\n assert_cmp(Ndarray_DIMS(Dy)[1], ==, Ndarray_DIMS(idx)[1]);\n assert_cmp(Ndarray_NDIM(Dx), ==, 3);\n assert_cmp(Ndarray_DIMS(Dx)[0], ==, Ndarray_DIMS(x)[0]);\n assert_cmp(Ndarray_DIMS(Dx)[1], ==, Ndarray_DIMS(x)[1]);\n assert_cmp(Ndarray_DIMS(Dx)[2], ==, Ndarray_DIMS(x)[2]);\n\n Ndarray_set_zero(Dx);\n start_dev_kernel(select_bw_kernel, (\n Ndarray_DEV_DATA(Dx),\n Ndarray_DIMS(Dx)[0],\n Ndarray_DIMS(Dx)[1],\n Ndarray_DIMS(Dx)[2],\n Ndarray_STRIDE(Dx, 0),\n Ndarray_STRIDE(Dx, 1),\n Ndarray_STRIDE(Dx, 2),\n Ndarray_DEV_DATA(idx),\n Ndarray_STRIDE(idx, 0),\n Ndarray_STRIDE(idx, 1),\n Ndarray_DEV_DATA(Dy),\n Ndarray_STRIDE(Dy, 0),\n Ndarray_STRIDE(Dy, 1)\n ));\n '[source]
NativeOp.subtensor_batched_index(x, idx)[source]
class NativeOp.SparseToDense[source]

Expects a sparse matrix in COOrdinate format, where W[s0[i,b],b,s1[i]] = weight[i,b] for all i, and all batches b. Will return W (time,batch,dim).

in_info = ({'ndim': 3, 'shape': (None, None, None), 'need_contiguous': True, 'name': '_initial_W', 'want_inplace': 0}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 's0'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 's1'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'weight'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'mask'})[source]
out_info = ({'ndim': 3, 'shape': ((0, 0), (0, 1), (0, 2)), 'name': 'W'},)[source]
c_extra_support_code = {'assign_kernel': '\n DEF_KERNEL\n void assign_kernel(\n float* out, float* s0, float* s1, float* w, float* mask,\n long n_sparse_idx, long n_time, long n_batch, long n_dim)\n {\n long max_idx = n_batch * n_sparse_idx;\n for(\n long idx = threadIdx.x + blockDim.x * blockIdx.x;\n idx < max_idx;\n idx += gridDim.x * blockDim.x)\n {\n if(mask[idx] < 0.1) continue;\n long batch = idx % n_batch;\n long t = (long) s0[idx];\n long j = (long) s1[idx];\n float y = w[idx];\n if(t < 0 || t >= n_time) continue; // error somehow?\n if(j < 0 || j >= n_dim) continue; // error somehow?\n long out_idx = t * n_batch * n_dim + batch * n_dim + j;\n out[out_idx] += y;\n }\n }\n '}[source]
c_fw_code = '\n assert(n_inputs == 5);\n assert(n_outputs == 1);\n Ndarray* s0 = inputs[1];\n Ndarray* s1 = inputs[2];\n Ndarray* weight = inputs[3];\n Ndarray* mask = inputs[4];\n Ndarray* out_W = *outputs[0];\n\n assert(Ndarray_NDIM(s0) == 2);\n assert(Ndarray_NDIM(s1) == 2);\n assert(Ndarray_NDIM(weight) == 2);\n assert(Ndarray_NDIM(mask) == 2);\n assert(Ndarray_NDIM(out_W) == 3);\n int n_sparse_idx = Ndarray_DIMS(s0)[0];\n assert(n_sparse_idx == Ndarray_DIMS(s1)[0]);\n assert(n_sparse_idx == Ndarray_DIMS(weight)[0]);\n assert(n_sparse_idx == Ndarray_DIMS(mask)[0]);\n int n_batch = Ndarray_DIMS(s0)[1];\n assert(n_batch == Ndarray_DIMS(s1)[1]);\n assert(n_batch == Ndarray_DIMS(weight)[1]);\n assert(n_batch == Ndarray_DIMS(mask)[1]);\n assert(n_batch == Ndarray_DIMS(out_W)[1]);\n int n_time = Ndarray_DIMS(out_W)[0];\n int n_dim = Ndarray_DIMS(out_W)[2];\n\n start_dev_kernel(assign_kernel, (\n Ndarray_DEV_DATA(out_W),\n Ndarray_DEV_DATA(s0),\n Ndarray_DEV_DATA(s1),\n Ndarray_DEV_DATA(weight),\n Ndarray_DEV_DATA(mask),\n n_sparse_idx, n_time, n_batch, n_dim\n ));\n '[source]
NativeOp.sparse_to_dense(s0, s1, weight, mask, n_time, n_dim)[source]
NativeOp.onehot_to_sparse(y, mask)[source]
NativeOp.sparse_slice_offset(s0, idx)[source]
Parameters:
  • s0 – 1D tensor, ordered indices for sparse coo-format matrix (without batch)
  • idx – scalar, index to find in s0
Returns:

s0_idx, such that s0[i] >= idx for all i >= s0_idx, s0[i] < idx for all i < s0_idx.

This assumes that the indices in s0 are ordered.

NativeOp.sparse_splice_offset_numpy(s0, idx)[source]

Like sparse_slice_offset().

class NativeOp.MaxAndArgmaxSparse[source]

Expects a sparse matrix in COOrdinate format, where W[s0[i,b],s1[i],b] = weight[i,b] for all i, and all batches b. It will return the max and argmax for all W[:,:,b] over the second axis.

in_info = ({'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'need_contiguous': True, 'name': 's0'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'need_contiguous': True, 'name': 's1'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'weight'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'need_contiguous': True, 'name': 'mask'}, {'ndim': 2, 'name': '_out_max', 'gradient': 'disconnected', 'need_contiguous': True, 'shape': (None, None), 'want_inplace': 0}, {'ndim': 2, 'name': '_out_arg', 'gradient': 'disconnected', 'need_contiguous': True, 'shape': (None, None), 'want_inplace': 1})[source]
out_info = ({'ndim': 2, 'shape': ((4, 0), (4, 1)), 'name': 'out_max'}, {'ndim': 2, 'shape': ((5, 0), (5, 1)), 'name': 'out_arg'})[source]
c_extra_support_code = {'doit_kernel': '\n DEF_KERNEL\n void doit_kernel(\n long n_batch, long n_in_time, long n_out_time,\n float* s0, float* s1, float* weight, float* mask,\n float* out_max, float* out_arg) {\n long batch_idx = threadIdx.x + blockDim.x * blockIdx.x;\n while(batch_idx < n_batch) {\n for(long i = 0; i < n_in_time; ++i) {\n long idx = i * n_batch + batch_idx;\n if(mask[idx] < 0.1) continue;\n long t = (long) s0[idx];\n long j = (long) s1[idx];\n float w = weight[idx];\n if(t < 0 || t >= n_out_time) continue; // error somehow?\n long out_idx = t * n_batch + batch_idx;\n if(w > out_max[out_idx]) {\n out_max[out_idx] = w;\n out_arg[out_idx] = (float) j;\n }\n }\n batch_idx += gridDim.x * blockDim.x;\n }\n }\n '}[source]
c_fw_code = '\n assert(n_inputs == 6);\n assert(n_outputs == 2);\n Ndarray* s0 = inputs[0];\n Ndarray* s1 = inputs[1];\n Ndarray* weight = inputs[2];\n Ndarray* mask = inputs[3];\n Ndarray* out_max = *outputs[0];\n Ndarray* out_arg = *outputs[1];\n\n assert(Ndarray_NDIM(s0) == 2);\n assert(Ndarray_NDIM(s1) == 2);\n assert(Ndarray_NDIM(weight) == 2);\n assert(Ndarray_NDIM(mask) == 2);\n assert(Ndarray_NDIM(out_max) == 2);\n assert(Ndarray_NDIM(out_arg) == 2);\n int n_in_time = Ndarray_DIMS(s0)[0];\n assert(n_in_time == Ndarray_DIMS(s1)[0]);\n assert(n_in_time == Ndarray_DIMS(weight)[0]);\n assert(n_in_time == Ndarray_DIMS(mask)[0]);\n int n_batch = Ndarray_DIMS(s0)[1];\n assert(n_batch == Ndarray_DIMS(s1)[1]);\n assert(n_batch == Ndarray_DIMS(weight)[1]);\n assert(n_batch == Ndarray_DIMS(mask)[1]);\n assert(n_batch == Ndarray_DIMS(out_arg)[1]);\n assert(n_batch == Ndarray_DIMS(out_max)[1]);\n int n_out_time = Ndarray_DIMS(out_arg)[0];\n assert(n_out_time == Ndarray_DIMS(out_max)[0]);\n assert(out_max != out_arg); // earlier bug in NativeOp\n\n start_dev_kernel(doit_kernel, (\n n_batch, n_in_time, n_out_time,\n Ndarray_DEV_DATA(s0),\n Ndarray_DEV_DATA(s1),\n Ndarray_DEV_DATA(weight),\n Ndarray_DEV_DATA(mask),\n Ndarray_DEV_DATA(out_max),\n Ndarray_DEV_DATA(out_arg)\n ));\n '[source]
code_version = ()[source]
NativeOp.max_and_argmax_sparse(s0, s1, weight, mask, out_max, out_arg)[source]
class NativeOp.CrossEntropySoftmaxAndGradientZSparse[source]

y_target is given in sparse COOrdinate format. We will calculate CE[t,b] = sum_i y_target[t,b,i] * log(softmax(z[t,b])[i]), for any timeframe t and batch b, and grad(CE[t,b], z[t,b]) = softmax(z[t,b]) - y_target[t,b]. We also support an index-mask for z, i.e. for the possible [t,b].

in_info = ({'ndim': 3, 'shape': (None, None, None), 'need_contiguous': True, 'name': 'z'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'z_mask'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'y_target_t'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'y_target_i'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'y_target_w'}, {'ndim': 2, 'shape': (None, None), 'need_contiguous': True, 'name': 'y_target_mask'})[source]
out_info = ({'ndim': 2, 'shape': ((0, 0), (0, 1)), 'name': 'out_ce'}, {'ndim': 3, 'shape': ((0, 0), (0, 1), (0, 2)), 'name': 'out_grad_z'}, {'ndim': 2, 'shape': ((0, 0), (0, 1)), 'name': '_out_max_z'})[source]
c_extra_support_code = {'softmax_kernel': '\n DEF_KERNEL\n void softmax_kernel(\n float* out_softmax,\n float* z, float* max_z, float* mask,\n long stride, long max_idx)\n {\n for(\n long idx = threadIdx.x + blockDim.x * blockIdx.x;\n idx < max_idx;\n idx += gridDim.x * blockDim.x)\n {\n long start = idx * stride;\n float s = 0;\n for(long i = 0; i < stride; ++i) {\n s += exp(z[start + i] - max_z[idx]);\n }\n if(s < 1e-16) s = 1e-16;\n for(long i = 0; i < stride; ++i) {\n float y = exp(z[start + i] - max_z[idx]) / s;\n out_softmax[start + i] = (mask[idx] > 0.5) ? y : 0;\n }\n }\n }\n ', 'ce_sm_grad_kernel': '\n DEF_KERNEL\n void ce_sm_grad_kernel(\n float* out_ce, float* out_grad_z,\n float* z, float* max_z, float* z_mask,\n float* s0, float* s1, float* w, float* s_mask,\n long n_time, long n_batch, long n_dim, long n_sparse_index)\n {\n long max_idx = n_batch * n_sparse_index;\n for(\n long idx = threadIdx.x + blockDim.x * blockIdx.x;\n idx < max_idx;\n idx += gridDim.x * blockDim.x)\n {\n if(s_mask[idx] < 0.1) continue;\n long batch = idx % n_batch;\n long t = (long) s0[idx];\n long j = (long) s1[idx];\n float y_target = w[idx];\n if(t < 0 || t >= n_time) continue; // error somehow?\n if(j < 0 || j >= n_dim) continue; // error somehow?\n long out_ce_idx = t * n_batch + batch;\n long out_y_idx = t * n_batch * n_dim + batch * n_dim + j;\n // This assumes that out_grad_z is still softmax(z).\n // This also assumes that every [t,j] is only represented once in the sparse data.\n out_ce[out_ce_idx] -= y_target * log(fmax(out_grad_z[out_y_idx], 1e-30f));\n out_grad_z[out_y_idx] -= y_target;\n }\n }\n ', 'max_kernel': '\n DEF_KERNEL\n void max_kernel(float* out, float* v, float* mask, long stride, long max_idx) {\n for(\n long idx = threadIdx.x + blockDim.x * blockIdx.x;\n idx < max_idx;\n idx += gridDim.x * blockDim.x)\n {\n if(mask[idx] < 0.1)\n continue;\n long start = idx * stride;\n float last_max = v[start];\n out[idx] = last_max;\n for(long i = 1; i < stride; ++i) {\n float cur = v[start + i];\n if(cur > last_max) {\n last_max = cur;\n out[idx] = cur;\n }\n }\n }\n }\n '}[source]
c_fw_code = '\n assert(n_inputs == 6);\n assert(n_outputs == 3);\n Ndarray* z = inputs[0];\n Ndarray* z_mask = inputs[1];\n Ndarray* s0 = inputs[2];\n Ndarray* s1 = inputs[3];\n Ndarray* w = inputs[4];\n Ndarray* s_mask = inputs[5];\n Ndarray* out_ce = *outputs[0];\n Ndarray* out_grad_z = *outputs[1];\n Ndarray* out_max_z = *outputs[2];\n\n assert(Ndarray_NDIM(z) == 3);\n assert(Ndarray_NDIM(z_mask) == 2);\n assert(Ndarray_NDIM(out_ce) == 2);\n assert(Ndarray_NDIM(out_grad_z) == 3);\n assert(Ndarray_NDIM(out_max_z) == 2);\n assert(Ndarray_NDIM(s0) == 2);\n assert(Ndarray_NDIM(s1) == 2);\n assert(Ndarray_NDIM(w) == 2);\n assert(Ndarray_NDIM(out_ce) == 2);\n int n_time = Ndarray_DIMS(z)[0];\n int n_batch = Ndarray_DIMS(z)[1];\n int n_dim = Ndarray_DIMS(z)[2];\n assert(n_time == Ndarray_DIMS(z_mask)[0]);\n assert(n_time == Ndarray_DIMS(out_ce)[0]);\n assert(n_time == Ndarray_DIMS(out_grad_z)[0]);\n assert(n_time == Ndarray_DIMS(out_max_z)[0]);\n assert(n_batch == Ndarray_DIMS(z_mask)[1]);\n assert(n_batch == Ndarray_DIMS(out_ce)[1]);\n assert(n_batch == Ndarray_DIMS(out_grad_z)[1]);\n assert(n_batch == Ndarray_DIMS(out_max_z)[1]);\n assert(n_batch == Ndarray_DIMS(s0)[1]);\n assert(n_batch == Ndarray_DIMS(s1)[1]);\n assert(n_batch == Ndarray_DIMS(w)[1]);\n assert(n_batch == Ndarray_DIMS(s_mask)[1]);\n assert(n_dim == Ndarray_DIMS(out_grad_z)[2]);\n int n_sparse_index = Ndarray_DIMS(s0)[0];\n assert(n_sparse_index == Ndarray_DIMS(s1)[0]);\n assert(n_sparse_index == Ndarray_DIMS(w)[0]);\n assert(n_sparse_index == Ndarray_DIMS(s_mask)[0]);\n\n start_dev_kernel(max_kernel, (\n Ndarray_DEV_DATA(out_max_z), Ndarray_DEV_DATA(z), Ndarray_DEV_DATA(z_mask),\n n_dim, n_time * n_batch\n ));\n Ndarray_set_zero(out_ce);\n start_dev_kernel(softmax_kernel, (\n Ndarray_DEV_DATA(out_grad_z),\n Ndarray_DEV_DATA(z), Ndarray_DEV_DATA(out_max_z), Ndarray_DEV_DATA(z_mask),\n n_dim, n_time * n_batch\n ));\n start_dev_kernel(ce_sm_grad_kernel, (\n Ndarray_DEV_DATA(out_ce), Ndarray_DEV_DATA(out_grad_z),\n Ndarray_DEV_DATA(z), Ndarray_DEV_DATA(out_max_z), Ndarray_DEV_DATA(z_mask),\n Ndarray_DEV_DATA(s0), Ndarray_DEV_DATA(s1), Ndarray_DEV_DATA(w), Ndarray_DEV_DATA(s_mask),\n n_time, n_batch, n_dim, n_sparse_index\n ));\n '[source]
NativeOp.crossentropy_softmax_and_gradient_z_sparse(z, z_mask, y_target_t, y_target_i, y_target_w, y_target_mask)[source]
NativeOp.crossentropy_softmax_and_gradient_z_sparse__slow(z, z_mask, y_target_t, y_target_i, y_target_w, y_target_mask)[source]
class NativeOp.FastBaumWelchOp[source]
inputs:
param am_scores:
 scores in -log space. 3d (time,batch,dim)
param edges:edges of the graph (from,to,emission_idx,sequence_idx)
param weights:weights of the edges
outputs:
param output:Baum-Welch alignment, scores in -log space. 3d (time,batch,dim), like am_scores
in_info = ({'ndim': 3, 'gradient': 'disconnected', 'shape': (None, None, None), 'need_contiguous': True, 'name': 'am_scores'}, {'ndim': 2, 'name': 'edges', 'gradient': 'disconnected', 'dtype': 'int32', 'need_contiguous': True, 'shape': (None, None)}, {'ndim': 1, 'gradient': 'disconnected', 'shape': (None,), 'need_contiguous': True, 'name': 'weights'}, {'ndim': 2, 'name': 'start_end_states', 'gradient': 'disconnected', 'dtype': 'int32', 'need_contiguous': True, 'shape': (2, None)}, {'ndim': 2, 'gradient': 'disconnected', 'shape': ((0, 0), (0, 1)), 'need_contiguous': True, 'name': 'index'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (2, None), 'need_contiguous': True, 'name': 'state_buffer'})[source]
out_info = ({'ndim': 3, 'shape': ((0, 0), (0, 1), (0, 2)), 'need_contiguous': True, 'name': 'output'}, {'ndim': 2, 'shape': ((0, 0), (0, 1)), 'need_contiguous': True, 'name': 'sums'})[source]
c_extra_support_code = {'100_init_bwd_state_buffer': '\n __global__\n void init_bwd_state_buffer(float* states, unsigned* end_states, unsigned t, unsigned max_t, float* index, unsigned index_stride) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (index[t * index_stride + idx] == 1.0 && (t == max_t || index[(t + 1) * index_stride + idx] == 0.0)) {\n unsigned state_idx = end_states[idx];\n states[state_idx] = 0.0;\n }\n }\n ', '110_write_alignment_to_file': '\n void write_alignment_to_file(float* d_state_buffer, float* d_index, unsigned index_stride,\n unsigned* d_start_states, unsigned* d_end_states,\n float pruning, unsigned n_frames, unsigned n_seqs, unsigned n_states, unsigned batch_idx) {\n std::vector<float> state_buffer((n_frames + 1u) * n_states);\n std::vector<float> index (n_frames * index_stride);\n std::vector<unsigned> start_states(n_seqs);\n std::vector<unsigned> end_states (n_seqs);\n\n HANDLE_ERROR(cudaMemcpy(state_buffer.data(), d_state_buffer, state_buffer.size() * sizeof(float), cudaMemcpyDeviceToHost));\n HANDLE_ERROR(cudaMemcpy(index.data(), d_index, index.size() * sizeof(float), cudaMemcpyDeviceToHost));\n HANDLE_ERROR(cudaMemcpy(start_states.data(), d_start_states, start_states.size() * sizeof(float), cudaMemcpyDeviceToHost));\n HANDLE_ERROR(cudaMemcpy(end_states.data(), d_end_states, end_states.size() * sizeof(float), cudaMemcpyDeviceToHost));\n\n for (unsigned seq = 0u; seq < n_seqs; seq++) {\n std::stringstream filename;\n filename << "alignment.dump." << batch_idx << \'.\' << seq;\n std::ofstream out(filename.str().c_str(), std::ios::out | std::ios::trunc);\n for (unsigned t = 0u; t <= n_frames; t++) {\n if (t > 0u and index[seq * index_stride + t] <= 0.0) {\n break;\n }\n float sum = std::numeric_limits<float>::infinity();\n for (unsigned s = start_states[seq]; s <= end_states[seq]; s++) {\n const float val = state_buffer[t * n_states + s];\n float diff = val - sum;\n if (!isnan(diff)) {\n sum = -log1p(exp(-abs(diff))) + min(sum, val);\n }\n }\n for (unsigned s = start_states[seq]; s <= end_states[seq]; s++) {\n const float val = state_buffer[t * n_states + s] - sum;\n if (val <= pruning) {\n out << t << \' \' << (s - start_states[seq]) << \' \' << val << \'\\n\';\n }\n }\n }\n }\n }\n ', '001_set_start_states': '\n __global__\n void set_start_states(float* states, unsigned* start_states) {\n unsigned state_idx = start_states[blockIdx.x * blockDim.x + threadIdx.x];\n states[state_idx] = 0.0;\n }\n ', '010_fill_array': '\n __global__\n void fill_array(float* array, float value, unsigned size) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx < size) {\n array[idx] = value;\n }\n }\n ', '111_write_output_to_file': '\n void write_output_to_file(float* d_out, float* d_index, unsigned index_stride,\n float pruning, unsigned n_frames, unsigned n_seqs, unsigned n_emissions, unsigned batch_idx) {\n std::vector<float> buffer(n_frames * n_seqs * n_emissions);\n std::vector<float> index (n_frames * index_stride);\n\n HANDLE_ERROR(cudaMemcpy(buffer.data(), d_out, buffer.size() * sizeof(float), cudaMemcpyDeviceToHost));\n HANDLE_ERROR(cudaMemcpy(index.data(), d_index, index.size() * sizeof(float), cudaMemcpyDeviceToHost));\n\n for (unsigned seq = 0u; seq < n_seqs; seq++) {\n std::stringstream filename;\n filename << "target.dump." << batch_idx << \'.\' << seq;\n std::ofstream out(filename.str().c_str(), std::ios::out | std::ios::trunc);\n for (unsigned t = 0u; t <= n_frames; t++) {\n if (t > 0u and index[seq * index_stride + t] <= 0.0) {\n break;\n }\n for (unsigned e = 0u; e < n_emissions; e++) {\n const float val = buffer[t * n_seqs * n_emissions + seq * n_emissions + e];\n if (val <= pruning) {\n out << t << \' \' << e << \' \' << val << \'\\n\';\n }\n }\n }\n }\n }\n ', '101_next_frame': '\n __global__\n void next_frame(bool fwd, unsigned num_edges, unsigned num_emissions,\n unsigned* sequence_idxs, unsigned* from_buffer, unsigned* to_buffer, float* weight_buffer, unsigned* emission_idxs,\n float* prev_frame, float* next_frame, float* am_scores, float* edge_buffer) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx >= num_edges) {\n return;\n }\n\n unsigned from = from_buffer [idx];\n float prev_val = prev_frame[from];\n if (isinf(prev_val)) {\n edge_buffer[idx] = CUDART_INF_F;\n return;\n }\n\n unsigned to = to_buffer [idx];\n unsigned emission_idx = emission_idxs[idx];\n float edge_weight = weight_buffer[idx];\n unsigned sequence_idx = sequence_idxs[idx];\n\n float val = prev_val + edge_weight + am_scores[sequence_idx * num_emissions + emission_idx];\n\n if (fwd) {\n edge_buffer[idx] += val;\n }\n else {\n edge_buffer[idx] += prev_val;\n }\n atomic_prob_add(next_frame + to, val);\n }\n ', '013_atomic_prob_add': '\n __device__\n void atomic_prob_add(float* a, float b) {\n int* addr = (int*)a;\n int old = __float_as_int(*a);\n int assumed;\n do {\n assumed = old;\n old = atomicCAS(addr, assumed, __float_as_int(prob_add(__int_as_float(old), b)));\n } while (old != assumed);\n }\n ', '020_dump_to_file': "\n template<typename T>\n void dump_to_file_1d(T* d_mem, unsigned n_d1, std::string const& path) {\n std::vector<T> buffer(n_d1);\n cudaMemcpy(buffer.data(), d_mem, buffer.size() * sizeof(T), cudaMemcpyDeviceToHost);\n\n std::ofstream output(path.c_str(), std::ios::trunc | std::ios::out);\n for (size_t i1 = 0ul; i1 < n_d1; i1++) {\n T val = buffer[i1];\n if (!std::numeric_limits<T>::has_infinity or !std::isinf(val)) {\n output << i1 << ' ' << val << '\\n';\n }\n }\n }\n\n template<typename T>\n void dump_to_file_2d(T* d_mem, unsigned n_d1, unsigned n_d2, std::string const& path) {\n std::vector<T> buffer(n_d1 * n_d2);\n cudaMemcpy(buffer.data(), d_mem, buffer.size() * sizeof(T), cudaMemcpyDeviceToHost);\n\n std::ofstream output(path.c_str(), std::ios::trunc | std::ios::out);\n for (size_t i1 = 0ul; i1 < n_d1; i1++) {\n for (size_t i2 = 0ul; i2 < n_d2; i2++) {\n T val = buffer[i1 * n_d2 + i2];\n if (!std::numeric_limits<T>::has_infinity or !std::isinf(val)) {\n output << i1 << ' ' << i2 << ' ' << val << '\\n';\n }\n }\n }\n }\n\n template<typename T>\n void dump_to_file_3d(T* d_mem, unsigned n_d1, unsigned n_d2, unsigned n_d3, std::string const& path) {\n std::vector<T> buffer(n_d1 * n_d2 * n_d3);\n cudaMemcpy(buffer.data(), d_mem, buffer.size() * sizeof(T), cudaMemcpyDeviceToHost);\n\n std::ofstream output(path.c_str(), std::ios::trunc | std::ios::out);\n for (size_t i1 = 0ul; i1 < n_d1; i1++) {\n for (size_t i2 = 0ul; i2 < n_d2; i2++) {\n for (size_t i3 = 0ul; i3 < n_d3; i3++) {\n T val = buffer[i1 * n_d2 * n_d3 + i2 * n_d3 + i3];\n if (!std::numeric_limits<T>::has_infinity or !std::isinf(val)) {\n output << i1 << ' ' << i2 << ' ' << i3 << ' ' << val << '\\n';\n }\n }\n }\n }\n }\n ", '012_prob_add': '\n __device__\n float prob_add(float a, float b) {\n float diff = a - b;\n if (isnan(diff)) {\n return CUDART_INF_F;\n }\n else {\n return -log1p(exp(-abs(diff))) + min(a, b);\n }\n }\n ', '102_normalize': '\n __global__\n void normalize(float* buffer, unsigned* sequence_idxs, unsigned num_edges, unsigned num_seqs, float* sum_output) {\n extern __shared__ float sum[];\n\n buffer += blockIdx.x * num_edges;\n\n for (unsigned s = 0u; s < num_seqs; s++) {\n sum[s] = CUDART_INF_F;\n }\n\n for (unsigned e = 0u; e < num_edges; e++) {\n unsigned s = sequence_idxs[e];\n sum[s] = prob_add(sum[s], buffer[e]);\n }\n\n for (unsigned s = 0ul; s < num_seqs; s++) {\n if (isinf(sum[s])) {\n // if the frame is empty (happens due to batching of seqs with unequal length), set it to 0\n sum_output[blockIdx.x * num_seqs + s] = 0.0;\n }\n else {\n sum_output[blockIdx.x * num_seqs + s] = sum[s];\n }\n }\n\n for (unsigned e = 0u; e < num_edges; e++) {\n unsigned s = sequence_idxs[e];\n buffer[e] -= sum[s];\n }\n }\n ', '103_compute_result': '\n __global__\n void compute_result(float* edge_buffer, float* out, unsigned* emission_idxs, unsigned* sequence_idxs,\n unsigned frame_stride, unsigned seq_stride,\n unsigned num_frames, unsigned num_seqs, unsigned num_edges) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx >= num_frames * num_edges) {\n return;\n }\n\n unsigned e_idx = idx % num_edges;\n unsigned frame = idx / num_edges;\n unsigned emission_idx = emission_idxs[e_idx];\n unsigned seq_idx = sequence_idxs[e_idx];\n float score = edge_buffer[idx];\n\n atomic_prob_add(out + frame * frame_stride + seq_idx * seq_stride + emission_idx, score);\n }\n ', '011_remove_inf': '\n __global__\n void remove_inf(float* array, unsigned size) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx < size) {\n array[idx] = fminf(array[idx], 1e32);\n }\n }\n '}[source]
c_fw_code = '\n // am_scores, edges, weights, start_end_states, index, state_buffer* = input_names (*: inplace)\n // output = output_names\n assert(n_inputs == 6);\n assert(n_outputs == 2);\n Ndarray* am_scores = inputs[0];\n Ndarray* edges = inputs[1];\n Ndarray* weights = inputs[2];\n Ndarray* start_end_states = inputs[3];\n Ndarray* index = inputs[4];\n Ndarray* state_buffer = inputs[5];\n Ndarray* out = *outputs[0];\n Ndarray* sum_output = *outputs[1];\n\n /*\n debug_print(context, am_scores, "am_scores");\n debug_print(context, edges, "edges");\n debug_print(context, weights, "weights");\n debug_print(context, start_end_states, "start_end_states");\n debug_print(context, index, "index");\n debug_print(context, state_buffer, "state_buffer");\n */\n\n assert(Ndarray_DIMS(am_scores)[0] == Ndarray_DIMS(out)[0]);\n assert(Ndarray_DIMS(am_scores)[1] == Ndarray_DIMS(out)[1]);\n assert(Ndarray_DIMS(am_scores)[2] == Ndarray_DIMS(out)[2]);\n assert(Ndarray_DIMS(am_scores)[1] == Ndarray_DIMS(start_end_states)[1]);\n\n assert(Ndarray_DIMS(sum_output)[0] == Ndarray_DIMS(am_scores)[0]);\n assert(Ndarray_DIMS(sum_output)[1] == Ndarray_DIMS(am_scores)[1]);\n\n bool dump_alignment = false;\n bool dump_output = false;\n unsigned dump_every = 40u;\n static unsigned batch_idx = 0u;\n float pruning = 10.f;\n\n unsigned* d_from = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA_int32(edges) + 0 * Ndarray_STRIDE(edges, 0));\n unsigned* d_to = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA_int32(edges) + 1 * Ndarray_STRIDE(edges, 0));\n unsigned* d_emission_idxs = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA_int32(edges) + 2 * Ndarray_STRIDE(edges, 0));\n unsigned* d_sequence_idxs = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA_int32(edges) + 3 * Ndarray_STRIDE(edges, 0));\n float* d_weights = Ndarray_DEV_DATA(weights);\n float* d_am_scores = Ndarray_DEV_DATA(am_scores);\n unsigned* d_start_states = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA_int32(start_end_states) + 0 * Ndarray_STRIDE(start_end_states, 0));\n unsigned* d_end_states = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA_int32(start_end_states) + 1 * Ndarray_STRIDE(start_end_states, 0));\n float* d_index = Ndarray_DEV_DATA(index);\n float* d_state_buffer_prev = Ndarray_DEV_DATA(state_buffer) + 0 * Ndarray_STRIDE(state_buffer, 0);\n float* d_state_buffer_next = Ndarray_DEV_DATA(state_buffer) + 1 * Ndarray_STRIDE(state_buffer, 0);\n float* d_out = Ndarray_DEV_DATA(out);\n float* d_sum_output = Ndarray_DEV_DATA(sum_output);\n\n unsigned n_frames = Ndarray_DIMS(am_scores)[0];\n unsigned n_seqs = Ndarray_DIMS(am_scores)[1];\n unsigned n_emissions = Ndarray_DIMS(am_scores)[2];\n unsigned n_states = Ndarray_DIMS(state_buffer)[1];\n unsigned n_edges = Ndarray_DIMS(edges)[1];\n unsigned n_threads = 1024u;\n unsigned n_blocks = (n_edges + n_threads - 1) / n_threads;\n\n unsigned frame_stride = Ndarray_STRIDE(am_scores, 0);\n unsigned sequence_stride = Ndarray_STRIDE(am_scores, 1);\n unsigned index_stride = Ndarray_STRIDE(index, 0);\n\n assert(n_frames > 0);\n\n //std::cerr << "n_frames: " << n_frames << std::endl;\n //std::cerr << "n_seqs: " << n_seqs << std::endl;\n //std::cerr << "n_emissions: " << n_emissions << std::endl;\n //std::cerr << "n_states: " << n_states << std::endl;\n //std::cerr << "n_edges: " << n_edges << std::endl;\n //std::cerr << "n_threads: " << n_threads << std::endl;\n //std::cerr << "n_blocks: " << n_blocks << std::endl;\n\n //std::cerr << "frame_stride: " << frame_stride << std::endl;\n //std::cerr << "sequnence_stride: " << sequence_stride << std::endl;\n //std::cerr << "index_stride: " << index_stride << std::endl;\n\n // initialize edge buffer\n float* d_edge_buffer = reinterpret_cast<float*>(device_malloc(n_edges * n_frames * sizeof(float)));\n unsigned n_fill_blocks = (n_edges * n_frames + n_threads - 1u) / n_threads;\n fill_array<<<n_fill_blocks, n_threads>>>(d_edge_buffer, 0.0, n_edges * n_frames);\n HANDLE_LAST_ERROR();\n\n // initialize the state buffer\n n_fill_blocks = (n_states + n_threads - 1u) / n_threads;\n fill_array<<<n_fill_blocks, n_threads>>>(d_state_buffer_prev, std::numeric_limits<float>::infinity(), n_states);\n HANDLE_LAST_ERROR();\n set_start_states<<<1, n_seqs>>>(d_state_buffer_prev, d_start_states);\n\n // initialize full state buffer (only used to dump the alignment)\n float* d_state_buffer_all = NULL;\n if (dump_alignment and batch_idx %% dump_every == 0) {\n d_state_buffer_all = reinterpret_cast<float*>(device_malloc(n_states * (n_frames + 1u) * sizeof(float)));\n cudaMemcpy(d_state_buffer_all, d_state_buffer_prev, n_states * sizeof(float), cudaMemcpyDeviceToDevice);\n }\n\n // fwd pass\n for (unsigned t = 0u; t < n_frames; t++) {\n fill_array<<<n_fill_blocks, n_threads>>>(d_state_buffer_next, std::numeric_limits<float>::infinity(), n_states);\n HANDLE_LAST_ERROR();\n next_frame<<<n_blocks, n_threads>>>(true, n_edges, sequence_stride,\n d_sequence_idxs, d_from, d_to, d_weights, d_emission_idxs,\n d_state_buffer_prev, d_state_buffer_next, d_am_scores + t * frame_stride, d_edge_buffer + t * n_edges);\n HANDLE_LAST_ERROR();\n if (dump_alignment and batch_idx %% dump_every == 0) {\n cudaMemcpy(d_state_buffer_all + (t + 1u) * n_states, d_state_buffer_next, n_states * sizeof(float), cudaMemcpyDeviceToDevice);\n }\n std::swap(d_state_buffer_prev, d_state_buffer_next);\n }\n\n // bwd pass\n fill_array<<<n_fill_blocks, n_threads>>>(d_state_buffer_prev, std::numeric_limits<float>::infinity(), n_states);\n HANDLE_LAST_ERROR();\n for (unsigned t = n_frames; t > 0; t--) {\n init_bwd_state_buffer<<<1, n_seqs>>>(d_state_buffer_prev, d_end_states, t - 1, n_frames - 1, d_index, index_stride);\n HANDLE_LAST_ERROR();\n if (dump_alignment and batch_idx %% dump_every == 0) {\n float alpha = 1.0f;\n HANDLE_ERROR(cublasSaxpy(handle, n_states, &alpha, d_state_buffer_prev, 1, d_state_buffer_all + t * n_states, 1));\n }\n fill_array<<<n_fill_blocks, n_threads>>>(d_state_buffer_next, std::numeric_limits<float>::infinity(), n_states);\n HANDLE_LAST_ERROR();\n next_frame<<<n_blocks, n_threads>>>(false, n_edges, sequence_stride,\n d_sequence_idxs, d_to, d_from, d_weights, d_emission_idxs,\n d_state_buffer_prev, d_state_buffer_next, d_am_scores + (t - 1) * frame_stride, d_edge_buffer + (t - 1) * n_edges);\n HANDLE_LAST_ERROR();\n std::swap(d_state_buffer_prev, d_state_buffer_next);\n }\n if (dump_alignment and batch_idx %% dump_every == 0) {\n float alpha = 1.0f;\n HANDLE_ERROR(cublasSaxpy(handle, n_states, &alpha, d_state_buffer_prev, 1, d_state_buffer_all, 1));\n }\n\n // normalize at each time frame\n normalize<<<n_frames, 1, n_seqs * sizeof(float)>>>(d_edge_buffer, d_sequence_idxs, n_edges, n_seqs, d_sum_output);\n HANDLE_LAST_ERROR();\n\n // dump alignment\n if (dump_alignment and batch_idx %% dump_every == 0) {\n write_alignment_to_file(d_state_buffer_all, d_index, index_stride, d_start_states, d_end_states,\n pruning, n_frames, n_seqs, n_states, batch_idx);\n }\n\n n_fill_blocks = (n_frames * n_seqs * n_emissions + n_threads - 1u) / n_threads;\n fill_array<<<n_fill_blocks, n_threads>>>(d_out, std::numeric_limits<float>::infinity(), n_frames * n_seqs * n_emissions);\n HANDLE_LAST_ERROR();\n\n frame_stride = Ndarray_STRIDE(out, 0);\n sequence_stride = Ndarray_STRIDE(out, 1);\n n_blocks = (n_frames * n_edges + n_threads - 1u) / n_threads;\n compute_result<<<n_blocks, n_threads>>>(d_edge_buffer, d_out, d_emission_idxs, d_sequence_idxs,\n frame_stride, sequence_stride, n_frames, n_seqs, n_edges);\n HANDLE_LAST_ERROR();\n\n #if TENSORFLOW\n // Certain TensorFlow code doesn\'t like inf, even if it is just the CheckNumerics,\n // which is helpful for debugging.\n // We replace it by a very high number, so that tf.exp(-out) will still result in 0.0.\n n_blocks = (n_frames * n_seqs * n_emissions + n_threads - 1u) / n_threads;\n remove_inf<<<n_blocks, n_threads>>>(d_out, n_frames * n_seqs * n_emissions);\n //debug_print(context, out, "out");\n #endif\n if (dump_output and batch_idx %% dump_every == 0) {\n write_output_to_file(d_out, d_index, index_stride, pruning, n_frames, n_seqs, n_emissions, batch_idx);\n }\n\n device_free(d_edge_buffer);\n if (d_state_buffer_all != NULL) {\n device_free(d_state_buffer_all);\n }\n batch_idx++;\n '[source]
c_bw_code = None[source]
cpu_support = False[source]
class NativeOp.MultiEndFastBaumWelchOp[source]
inputs:
param am_scores:
 scores in -log space. 3d (time,batch,dim)
param edges:edges of the graph (from,to,emission_idx,sequence_idx)
param weights:weights of the edges
outputs:
param output:Baum-Welch alignment, scores in -log space. 3d (time,batch,dim), like am_scores
in_info = ({'ndim': 3, 'gradient': 'disconnected', 'shape': (None, None, None), 'need_contiguous': True, 'name': 'am_scores'}, {'ndim': 2, 'name': 'edges', 'gradient': 'disconnected', 'dtype': 'int32', 'need_contiguous': True, 'shape': (None, None)}, {'ndim': 1, 'gradient': 'disconnected', 'shape': (None,), 'need_contiguous': True, 'name': 'weights'}, {'ndim': 1, 'name': 'start_states', 'gradient': 'disconnected', 'dtype': 'int32', 'need_contiguous': True, 'shape': None}, {'ndim': 2, 'name': 'end_states', 'gradient': 'disconnected', 'dtype': 'int32', 'need_contiguous': True, 'shape': (None, 2)}, {'ndim': 1, 'gradient': 'disconnected', 'shape': (4, 0), 'need_contiguous': True, 'name': 'end_state_weights'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': ((0, 0), (0, 1)), 'need_contiguous': True, 'name': 'index'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (2, None), 'need_contiguous': True, 'name': 'state_buffer'})[source]
out_info = ({'ndim': 3, 'shape': ((0, 0), (0, 1), (0, 2)), 'need_contiguous': True, 'name': 'output'}, {'ndim': 2, 'shape': ((0, 0), (0, 1)), 'need_contiguous': True, 'name': 'sums'})[source]
c_extra_support_code = {'100_init_bwd_state_buffer': '\n __global__\n void init_bwd_state_buffer(float* states, unsigned* end_states, float* end_state_weigths, unsigned t, unsigned max_t, float* index, unsigned index_stride) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n unsigned seq_idx = end_states[idx * 2u + 0u];\n if (index[t * index_stride + seq_idx] == 1.0 && (t == max_t || index[(t + 1) * index_stride + seq_idx] == 0.0)) {\n unsigned state_idx = end_states[idx * 2u + 1u];\n float weight = end_state_weights[idx];\n states[state_idx] = weight;\n }\n }\n ', '001_set_start_states': '\n __global__\n void set_start_states(float* states, unsigned* start_states) {\n unsigned state_idx = start_states[blockIdx.x * blockDim.x + threadIdx.x];\n states[state_idx] = 0.0;\n }\n ', '110_write_alignment_to_file': '\n void write_alignment_to_file(float* d_state_buffer, float* d_index, unsigned index_stride,\n unsigned* d_start_states, unsigned* d_end_states,\n float pruning, unsigned n_frames, unsigned n_seqs, unsigned n_states, unsigned batch_idx) {\n std::vector<float> state_buffer((n_frames + 1u) * n_states);\n std::vector<float> index (n_frames * index_stride);\n std::vector<unsigned> start_states(n_seqs);\n std::vector<unsigned> end_states (n_seqs);\n\n HANDLE_ERROR(cudaMemcpy(state_buffer.data(), d_state_buffer, state_buffer.size() * sizeof(float), cudaMemcpyDeviceToHost));\n HANDLE_ERROR(cudaMemcpy(index.data(), d_index, index.size() * sizeof(float), cudaMemcpyDeviceToHost));\n HANDLE_ERROR(cudaMemcpy(start_states.data(), d_start_states, start_states.size() * sizeof(float), cudaMemcpyDeviceToHost));\n HANDLE_ERROR(cudaMemcpy(end_states.data(), d_end_states, end_states.size() * sizeof(float), cudaMemcpyDeviceToHost));\n\n for (unsigned seq = 0u; seq < n_seqs; seq++) {\n std::stringstream filename;\n filename << "alignment.dump." << batch_idx << \'.\' << seq;\n std::ofstream out(filename.str().c_str(), std::ios::out | std::ios::trunc);\n for (unsigned t = 0u; t <= n_frames; t++) {\n if (t > 0u and index[seq * index_stride + t] <= 0.0) {\n break;\n }\n float sum = std::numeric_limits<float>::infinity();\n for (unsigned s = start_states[seq]; s <= end_states[seq]; s++) {\n const float val = state_buffer[t * n_states + s];\n float diff = val - sum;\n if (!isnan(diff)) {\n sum = -log1p(exp(-abs(diff))) + min(sum, val);\n }\n }\n for (unsigned s = start_states[seq]; s <= end_states[seq]; s++) {\n const float val = state_buffer[t * n_states + s] - sum;\n if (val <= pruning) {\n out << t << \' \' << (s - start_states[seq]) << \' \' << val << \'\\n\';\n }\n }\n }\n }\n }\n ', '013_atomic_prob_add': '\n __device__\n void atomic_prob_add(float* a, float b) {\n int* addr = (int*)a;\n int old = __float_as_int(*a);\n int assumed;\n do {\n assumed = old;\n old = atomicCAS(addr, assumed, __float_as_int(prob_add(__int_as_float(old), b)));\n } while (old != assumed);\n }\n ', '111_write_output_to_file': '\n void write_output_to_file(float* d_out, float* d_index, unsigned index_stride,\n float pruning, unsigned n_frames, unsigned n_seqs, unsigned n_emissions, unsigned batch_idx) {\n std::vector<float> buffer(n_frames * n_seqs * n_emissions);\n std::vector<float> index (n_frames * index_stride);\n\n HANDLE_ERROR(cudaMemcpy(buffer.data(), d_out, buffer.size() * sizeof(float), cudaMemcpyDeviceToHost));\n HANDLE_ERROR(cudaMemcpy(index.data(), d_index, index.size() * sizeof(float), cudaMemcpyDeviceToHost));\n\n for (unsigned seq = 0u; seq < n_seqs; seq++) {\n std::stringstream filename;\n filename << "target.dump." << batch_idx << \'.\' << seq;\n std::ofstream out(filename.str().c_str(), std::ios::out | std::ios::trunc);\n for (unsigned t = 0u; t <= n_frames; t++) {\n if (t > 0u and index[seq * index_stride + t] <= 0.0) {\n break;\n }\n for (unsigned e = 0u; e < n_emissions; e++) {\n const float val = buffer[t * n_seqs * n_emissions + seq * n_emissions + e];\n if (val <= pruning) {\n out << t << \' \' << e << \' \' << val << \'\\n\';\n }\n }\n }\n }\n }\n ', '101_next_frame': '\n __global__\n void next_frame(bool fwd, unsigned num_edges, unsigned num_emissions,\n unsigned* sequence_idxs, unsigned* from_buffer, unsigned* to_buffer, float* weight_buffer, unsigned* emission_idxs,\n float* prev_frame, float* next_frame, float* am_scores, float* edge_buffer) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx >= num_edges) {\n return;\n }\n\n unsigned from = from_buffer [idx];\n float prev_val = prev_frame[from];\n if (isinf(prev_val)) {\n edge_buffer[idx] = CUDART_INF_F;\n return;\n }\n\n unsigned to = to_buffer [idx];\n unsigned emission_idx = emission_idxs[idx];\n float edge_weight = weight_buffer[idx];\n unsigned sequence_idx = sequence_idxs[idx];\n\n float val = prev_val + edge_weight + am_scores[sequence_idx * num_emissions + emission_idx];\n\n if (fwd) {\n edge_buffer[idx] += val;\n }\n else {\n edge_buffer[idx] += prev_val;\n }\n atomic_prob_add(next_frame + to, val);\n }\n ', '010_fill_array': '\n __global__\n void fill_array(float* array, float value, unsigned size) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx < size) {\n array[idx] = value;\n }\n }\n ', '020_dump_to_file': "\n template<typename T>\n void dump_to_file_1d(T* d_mem, unsigned n_d1, std::string const& path) {\n std::vector<T> buffer(n_d1);\n cudaMemcpy(buffer.data(), d_mem, buffer.size() * sizeof(T), cudaMemcpyDeviceToHost);\n\n std::ofstream output(path.c_str(), std::ios::trunc | std::ios::out);\n for (size_t i1 = 0ul; i1 < n_d1; i1++) {\n T val = buffer[i1];\n if (!std::numeric_limits<T>::has_infinity or !std::isinf(val)) {\n output << i1 << ' ' << val << '\\n';\n }\n }\n }\n\n template<typename T>\n void dump_to_file_2d(T* d_mem, unsigned n_d1, unsigned n_d2, std::string const& path) {\n std::vector<T> buffer(n_d1 * n_d2);\n cudaMemcpy(buffer.data(), d_mem, buffer.size() * sizeof(T), cudaMemcpyDeviceToHost);\n\n std::ofstream output(path.c_str(), std::ios::trunc | std::ios::out);\n for (size_t i1 = 0ul; i1 < n_d1; i1++) {\n for (size_t i2 = 0ul; i2 < n_d2; i2++) {\n T val = buffer[i1 * n_d2 + i2];\n if (!std::numeric_limits<T>::has_infinity or !std::isinf(val)) {\n output << i1 << ' ' << i2 << ' ' << val << '\\n';\n }\n }\n }\n }\n\n template<typename T>\n void dump_to_file_3d(T* d_mem, unsigned n_d1, unsigned n_d2, unsigned n_d3, std::string const& path) {\n std::vector<T> buffer(n_d1 * n_d2 * n_d3);\n cudaMemcpy(buffer.data(), d_mem, buffer.size() * sizeof(T), cudaMemcpyDeviceToHost);\n\n std::ofstream output(path.c_str(), std::ios::trunc | std::ios::out);\n for (size_t i1 = 0ul; i1 < n_d1; i1++) {\n for (size_t i2 = 0ul; i2 < n_d2; i2++) {\n for (size_t i3 = 0ul; i3 < n_d3; i3++) {\n T val = buffer[i1 * n_d2 * n_d3 + i2 * n_d3 + i3];\n if (!std::numeric_limits<T>::has_infinity or !std::isinf(val)) {\n output << i1 << ' ' << i2 << ' ' << i3 << ' ' << val << '\\n';\n }\n }\n }\n }\n }\n ", '012_prob_add': '\n __device__\n float prob_add(float a, float b) {\n float diff = a - b;\n if (isnan(diff)) {\n return CUDART_INF_F;\n }\n else {\n return -log1p(exp(-abs(diff))) + min(a, b);\n }\n }\n ', '102_normalize': '\n __global__\n void normalize(float* buffer, unsigned* sequence_idxs, unsigned num_edges, unsigned num_seqs, float* sum_output) {\n extern __shared__ float sum[];\n\n buffer += blockIdx.x * num_edges;\n\n for (unsigned s = 0u; s < num_seqs; s++) {\n sum[s] = CUDART_INF_F;\n }\n\n for (unsigned e = 0u; e < num_edges; e++) {\n unsigned s = sequence_idxs[e];\n sum[s] = prob_add(sum[s], buffer[e]);\n }\n\n for (unsigned s = 0ul; s < num_seqs; s++) {\n if (isinf(sum[s])) {\n // if the frame is empty (happens due to batching of seqs with unequal length), set it to 0\n sum_output[blockIdx.x * num_seqs + s] = 0.0;\n }\n else {\n sum_output[blockIdx.x * num_seqs + s] = sum[s];\n }\n }\n\n for (unsigned e = 0u; e < num_edges; e++) {\n unsigned s = sequence_idxs[e];\n buffer[e] -= sum[s];\n }\n }\n ', '103_compute_result': '\n __global__\n void compute_result(float* edge_buffer, float* out, unsigned* emission_idxs, unsigned* sequence_idxs,\n unsigned frame_stride, unsigned seq_stride,\n unsigned num_frames, unsigned num_seqs, unsigned num_edges) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx >= num_frames * num_edges) {\n return;\n }\n\n unsigned e_idx = idx % num_edges;\n unsigned frame = idx / num_edges;\n unsigned emission_idx = emission_idxs[e_idx];\n unsigned seq_idx = sequence_idxs[e_idx];\n float score = edge_buffer[idx];\n\n atomic_prob_add(out + frame * frame_stride + seq_idx * seq_stride + emission_idx, score);\n }\n ', '011_remove_inf': '\n __global__\n void remove_inf(float* array, unsigned size) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx < size) {\n array[idx] = fminf(array[idx], 1e32);\n }\n }\n '}[source]
c_fw_code = '\n // am_scores, edges, weights, start_states, end_states, end_state_weigths index, state_buffer* = input_names (*: inplace)\n // output = output_names\n assert(n_inputs == 8);\n assert(n_outputs == 2);\n Ndarray* am_scores = inputs[0];\n Ndarray* edges = inputs[1];\n Ndarray* weights = inputs[2];\n Ndarray* start_states = inputs[3];\n Ndarray* end_states = inputs[4];\n Ndarray* end_state_weights = inputs[5];\n Ndarray* index = inputs[6];\n Ndarray* state_buffer = inputs[7];\n Ndarray* out = *outputs[0];\n Ndarray* sum_output = *outputs[1];\n\n assert(Ndarray_DIMS(am_scores)[0] == Ndarray_DIMS(out)[0]);\n assert(Ndarray_DIMS(am_scores)[1] == Ndarray_DIMS(out)[1]);\n assert(Ndarray_DIMS(am_scores)[2] == Ndarray_DIMS(out)[2]);\n assert(Ndarray_DIMS(am_scores)[1] == Ndarray_DIMS(start_end_states)[1]);\n\n assert(Ndarray_DIMS(sum_output)[0] == Ndarray_DIMS(am_scores)[0]);\n assert(Ndarray_DIMS(sum_output)[1] == Ndarray_DIMS(am_scores)[1]);\n\n bool dump_alignment = false;\n bool dump_output = false;\n unsigned dump_every = 40u;\n static unsigned batch_idx = 0u;\n float pruning = 10.f;\n\n unsigned* d_from = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA_int32(edges) + 0 * Ndarray_STRIDE(edges, 0));\n unsigned* d_to = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA_int32(edges) + 1 * Ndarray_STRIDE(edges, 0));\n unsigned* d_emission_idxs = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA_int32(edges) + 2 * Ndarray_STRIDE(edges, 0));\n unsigned* d_sequence_idxs = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA_int32(edges) + 3 * Ndarray_STRIDE(edges, 0));\n float* d_weights = Ndarray_DEV_DATA(weights);\n float* d_am_scores = Ndarray_DEV_DATA(am_scores);\n unsigned* d_start_states = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA_int32(start_states));\n unsigned* d_end_states = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA_int32(end_states));\n float* d_end_state_weigths = Ndarray_DEV_DATA(end_state_weights);\n float* d_index = Ndarray_DEV_DATA(index);\n float* d_state_buffer_prev = Ndarray_DEV_DATA(state_buffer) + 0 * Ndarray_STRIDE(state_buffer, 0);\n float* d_state_buffer_next = Ndarray_DEV_DATA(state_buffer) + 1 * Ndarray_STRIDE(state_buffer, 0);\n float* d_out = Ndarray_DEV_DATA(out);\n float* d_sum_output = Ndarray_DEV_DATA(sum_output);\n\n unsigned n_frames = Ndarray_DIMS(am_scores)[0];\n unsigned n_seqs = Ndarray_DIMS(am_scores)[1];\n unsigned n_emissions = Ndarray_DIMS(am_scores)[2];\n unsigned n_states = Ndarray_DIMS(state_buffer)[1];\n unsigned n_edges = Ndarray_DIMS(edges)[1];\n unsigned n_end_states = Ndarray_DIMS(end_states)[0];\n unsigned n_threads = 1024u;\n unsigned n_blocks = (n_edges + n_threads - 1) / n_threads;\n\n unsigned frame_stride = Ndarray_STRIDE(am_scores, 0);\n unsigned sequence_stride = Ndarray_STRIDE(am_scores, 1);\n unsigned index_stride = Ndarray_STRIDE(index, 0);\n\n assert(n_frames > 0);\n\n //std::cerr << "n_frames: " << n_frames << std::endl;\n //std::cerr << "n_seqs: " << n_seqs << std::endl;\n //std::cerr << "n_emissions: " << n_emissions << std::endl;\n //std::cerr << "n_states: " << n_states << std::endl;\n //std::cerr << "n_edges: " << n_edges << std::endl;\n //std::cerr << "n_end_states: " << n_end_states << std::endl;\n //std::cerr << "n_threads: " << n_threads << std::endl;\n //std::cerr << "n_blocks: " << n_blocks << std::endl;\n\n //std::cerr << "frame_stride: " << frame_stride << std::endl;\n //std::cerr << "sequnence_stride: " << sequence_stride << std::endl;\n //std::cerr << "index_stride: " << index_stride << std::endl;\n\n // initialize edge buffer\n float* d_edge_buffer = reinterpret_cast<float*>(device_malloc(n_edges * n_frames * sizeof(float)));\n unsigned n_fill_blocks = (n_edges * n_frames + n_threads - 1u) / n_threads;\n fill_array<<<n_fill_blocks, n_threads>>>(d_edge_buffer, 0.0, n_edges * n_frames);\n HANDLE_LAST_ERROR();\n\n // initialize the state buffer\n n_fill_blocks = (n_states + n_threads - 1u) / n_threads;\n fill_array<<<n_fill_blocks, n_threads>>>(d_state_buffer_prev, std::numeric_limits<float>::infinity(), n_states);\n HANDLE_LAST_ERROR();\n set_start_states<<<1, n_seqs>>>(d_state_buffer_prev, d_start_states);\n\n // initialize full state buffer (only used to dump the alignment)\n float* d_state_buffer_all = NULL;\n if (dump_alignment and batch_idx %% dump_every == 0) {\n d_state_buffer_all = reinterpret_cast<float*>(device_malloc(n_states * (n_frames + 1u) * sizeof(float)));\n cudaMemcpy(d_state_buffer_all, d_state_buffer_prev, n_states * sizeof(float), cudaMemcpyDeviceToDevice);\n }\n\n // fwd pass\n for (unsigned t = 0u; t < n_frames; t++) {\n fill_array<<<n_fill_blocks, n_threads>>>(d_state_buffer_next, std::numeric_limits<float>::infinity(), n_states);\n HANDLE_LAST_ERROR();\n next_frame<<<n_blocks, n_threads>>>(true, n_edges, sequence_stride,\n d_sequence_idxs, d_from, d_to, d_weights, d_emission_idxs,\n d_state_buffer_prev, d_state_buffer_next, d_am_scores + t * frame_stride, d_edge_buffer + t * n_edges);\n HANDLE_LAST_ERROR();\n if (dump_alignment and batch_idx %% dump_every == 0) {\n cudaMemcpy(d_state_buffer_all + (t + 1u) * n_states, d_state_buffer_next, n_states * sizeof(float), cudaMemcpyDeviceToDevice);\n }\n std::swap(d_state_buffer_prev, d_state_buffer_next);\n }\n\n // bwd pass\n fill_array<<<n_fill_blocks, n_threads>>>(d_state_buffer_prev, std::numeric_limits<float>::infinity(), n_states);\n HANDLE_LAST_ERROR();\n for (unsigned t = n_frames; t > 0; t--) {\n init_bwd_state_buffer<<<1, n_end_states>>>(d_state_buffer_prev, d_end_states, d_end_state_weigths, t - 1, n_frames - 1, d_index, index_stride);\n HANDLE_LAST_ERROR();\n if (dump_alignment and batch_idx %% dump_every == 0) {\n float alpha = 1.0f;\n HANDLE_ERROR(cublasSaxpy(handle, n_states, &alpha, d_state_buffer_prev, 1, d_state_buffer_all + t * n_states, 1));\n }\n fill_array<<<n_fill_blocks, n_threads>>>(d_state_buffer_next, std::numeric_limits<float>::infinity(), n_states);\n HANDLE_LAST_ERROR();\n next_frame<<<n_blocks, n_threads>>>(false, n_edges, sequence_stride,\n d_sequence_idxs, d_to, d_from, d_weights, d_emission_idxs,\n d_state_buffer_prev, d_state_buffer_next, d_am_scores + (t - 1) * frame_stride, d_edge_buffer + (t - 1) * n_edges);\n HANDLE_LAST_ERROR();\n std::swap(d_state_buffer_prev, d_state_buffer_next);\n }\n if (dump_alignment and batch_idx %% dump_every == 0) {\n float alpha = 1.0f;\n HANDLE_ERROR(cublasSaxpy(handle, n_states, &alpha, d_state_buffer_prev, 1, d_state_buffer_all, 1));\n }\n\n // normalize at each time frame\n normalize<<<n_frames, 1, n_seqs * sizeof(float)>>>(d_edge_buffer, d_sequence_idxs, n_edges, n_seqs, d_sum_output);\n HANDLE_LAST_ERROR();\n\n // dump alignment\n if (dump_alignment and batch_idx %% dump_every == 0) {\n write_alignment_to_file(d_state_buffer_all, d_index, index_stride, d_start_states, d_end_states,\n pruning, n_frames, n_seqs, n_states, batch_idx);\n }\n\n n_fill_blocks = (n_frames * n_seqs * n_emissions + n_threads - 1u) / n_threads;\n fill_array<<<n_fill_blocks, n_threads>>>(d_out, std::numeric_limits<float>::infinity(), n_frames * n_seqs * n_emissions);\n HANDLE_LAST_ERROR();\n\n frame_stride = Ndarray_STRIDE(out, 0);\n sequence_stride = Ndarray_STRIDE(out, 1);\n n_blocks = (n_frames * n_edges + n_threads - 1u) / n_threads;\n compute_result<<<n_blocks, n_threads>>>(d_edge_buffer, d_out, d_emission_idxs, d_sequence_idxs,\n frame_stride, sequence_stride, n_frames, n_seqs, n_edges);\n HANDLE_LAST_ERROR();\n\n #if TENSORFLOW\n // Certain TensorFlow code doesn\'t like inf, even if it is just the CheckNumerics,\n // which is helpful for debugging.\n // We replace it by a very high number, so that tf.exp(-out) will still result in 0.0.\n n_blocks = (n_frames * n_seqs * n_emissions + n_threads - 1u) / n_threads;\n remove_inf<<<n_blocks, n_threads>>>(d_out, n_frames * n_seqs * n_emissions);\n //debug_print(context, out, "out");\n #endif\n if (dump_output and batch_idx %% dump_every == 0) {\n write_output_to_file(d_out, d_index, index_stride, pruning, n_frames, n_seqs, n_emissions, batch_idx);\n }\n\n device_free(d_edge_buffer);\n if (d_state_buffer_all != NULL) {\n device_free(d_state_buffer_all);\n }\n batch_idx++;\n '[source]
c_bw_code = None[source]
cpu_support = False[source]
class NativeOp.SegmentFastBaumWelchOp(segmentwise_normalization=False, dump_targets_interval=None)[source]
in_info = ({'ndim': 3, 'gradient': 'disconnected', 'shape': (None, None, None), 'need_contiguous': True, 'name': 'am_scores'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'need_contiguous': True, 'name': 'batch_idxs'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, None), 'need_contiguous': True, 'name': 'edges'}, {'ndim': 1, 'gradient': 'disconnected', 'shape': ((2, 1),), 'need_contiguous': True, 'name': 'weights'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (None, (0, 0)), 'need_contiguous': True, 'name': 'length_models'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': (2, None), 'need_contiguous': True, 'name': 'start_end_states'}, {'ndim': 2, 'gradient': 'disconnected', 'shape': ((0, 0), (0, 1)), 'need_contiguous': True, 'name': 'index'}, {'ndim': 1, 'gradient': 'disconnected', 'shape': (None,), 'need_contiguous': True, 'name': 'am_score_scales'}, {'ndim': 0, 'gradient': 'disconnected', 'shape': (), 'need_contiguous': True, 'name': 'epoch'})[source]
out_info = ({'ndim': 3, 'shape': ((0, 0), (0, 1), (0, 2)), 'need_contiguous': True, 'name': 'output'}, {'ndim': 2, 'shape': ((0, 0), (0, 1)), 'need_contiguous': True, 'name': 'normalization_factors'}, {'ndim': 2, 'shape': ((0, 0), (0, 1)), 'need_contiguous': True, 'name': 'posterior_weigths'})[source]
c_extra_support_code = {'100_init_bwd_state_buffer': '\n __global__\n void init_bwd_state_buffer(unsigned t, unsigned num_batches, unsigned num_seqs,\n int* batch_idxs, float* index, float* states, unsigned* end_states) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n int batch_idx = batch_idxs[t * num_seqs + idx];\n if (batch_idx < 0) {\n return;\n }\n float* batch_first_frame = index + batch_idx;\n //if (*batch_first_frame != 0.0 && (t == max_t || *(batch_first_frame + 1) == 0.0)) {\n if (batch_first_frame[0] != 0.0 && batch_first_frame[num_batches] == 0.0) {\n unsigned state_idx = end_states[idx];\n states[state_idx] = 0.0;\n }\n }\n ', '101_next_frame_fwd': '\n __global__\n void next_frame_fwd(unsigned time, unsigned num_states, unsigned num_edges, unsigned num_emissions, unsigned num_seg_frames,\n unsigned num_tot_frames, unsigned num_seqs, unsigned num_am_score_scales,\n unsigned const* sequence_idxs, unsigned const* from_buffer, unsigned const* to_buffer, float const* weight_buffer,\n unsigned const* emission_idxs, unsigned const* lenmod_idxs, int const* batch_idxs,\n float const* am_scores, float const* length_models, float const* am_score_scales, float const* epoch,\n float* state_buffer, float* edge_buffer) {\n const unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx >= num_edges) {\n return;\n }\n\n const unsigned num_ringbuffer_frames = num_seg_frames + 1;\n const unsigned max_seg_frames = min(num_seg_frames, num_tot_frames - time);\n\n const unsigned prev_frame_idx = time % num_ringbuffer_frames;\n const unsigned prev_frame_start = prev_frame_idx * num_states;\n\n const unsigned from = from_buffer [idx];\n const float prev_val = state_buffer[prev_frame_start + from];\n if (isinf(prev_val)) {\n return;\n }\n\n const unsigned sequence_idx = sequence_idxs[idx];\n const int batch_idx = batch_idxs[time * num_seqs + sequence_idx];\n if (batch_idx == -1) {\n return;\n }\n\n const unsigned amss_idx = min(static_cast<unsigned>(*epoch), num_am_score_scales - 1);\n const float am_score_scale = am_score_scales[amss_idx];\n\n const unsigned to = to_buffer [idx];\n const unsigned emission_idx = emission_idxs[idx];\n const unsigned lenmod_idx = lenmod_idxs [idx];\n const float edge_weight = weight_buffer[idx];\n const float prev_plus_edge = prev_val + edge_weight;\n\n float const* am_buffer_in = am_scores + batch_idx * num_seg_frames * num_emissions + emission_idx;\n float const* length_scores = length_models + lenmod_idx * num_seg_frames;\n float* edge_buffer_out = edge_buffer + idx;\n\n for (unsigned i = 0u; i < max_seg_frames; i++) {\n const float val = prev_plus_edge + am_score_scale * am_buffer_in[i * num_emissions] + length_scores[i];\n edge_buffer_out[i * num_edges] = val;\n const unsigned next_frame = (prev_frame_idx + 1 + i) % num_ringbuffer_frames;\n atomic_prob_add(state_buffer + (next_frame * num_states + to), val);\n }\n }\n ', '001_set_start_states': '\n __global__\n void set_start_states(float* states, unsigned* start_states) {\n unsigned state_idx = start_states[blockIdx.x * blockDim.x + threadIdx.x];\n states[state_idx] = 0.0;\n }\n ', '010_fill_array': '\n __global__\n void fill_array(float* array, float value, unsigned size) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx < size) {\n array[idx] = value;\n }\n }\n ', '103_compute_framewise_sum': '\n __global__\n void compute_framewise_sum(unsigned num_tot_frames, unsigned num_seqs, unsigned num_seg_frames, unsigned num_batches, unsigned num_edges,\n unsigned const* sequence_idxs, int const* batch_idxs, float const* index, float const* edge_buffer,\n float* output_buffer) {\n extern __shared__ float sum[];\n\n const unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx >= num_tot_frames * num_seg_frames) {\n return;\n }\n\n float* sum_buffer = sum + threadIdx.x * num_seqs;\n edge_buffer += idx * num_edges;\n\n for (unsigned s = 0u; s < num_seqs; s++) {\n sum_buffer[s] = CUDART_INF_F;\n }\n\n for (unsigned i = 0; i < num_edges; i++) {\n const unsigned seq_idx = sequence_idxs[i];\n sum_buffer[seq_idx] = prob_add(sum_buffer[seq_idx], edge_buffer[i]);\n }\n\n const unsigned time = idx / num_seg_frames;\n const unsigned seg_size = idx % num_seg_frames;\n for (unsigned s = 0u; s < num_seqs; s++) {\n const int batch_idx = batch_idxs[time * num_seqs + s];\n if (batch_idx >= 0) {\n const unsigned output_idx = seg_size * num_batches + batch_idx;\n if (isinf(sum_buffer[s]) or index[output_idx] == 0.0) {\n output_buffer[output_idx] = 0.0;\n }\n else {\n output_buffer[output_idx] = sum_buffer[s];\n }\n }\n }\n }\n ', '102_next_frame_bwd': '\n __global__\n void next_frame_bwd(unsigned time, unsigned num_states, unsigned num_edges, unsigned num_emissions, unsigned num_seg_frames,\n unsigned num_tot_frames, unsigned num_seqs, unsigned num_am_score_scales,\n unsigned const* sequence_idxs, unsigned const* from_buffer, unsigned const* to_buffer, float const* weight_buffer,\n unsigned const* emission_idxs, unsigned const* lenmod_idxs, int const* batch_idxs,\n float const* am_scores, float const* length_models, float const* am_score_scales, float const* epoch,\n float* state_buffer, float* edge_buffer) {\n const unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx >= num_edges) {\n return;\n }\n\n const unsigned num_ringbuffer_frames = num_seg_frames + 1;\n const unsigned max_seg_frames = min(num_seg_frames, num_tot_frames - time);\n\n const unsigned sequence_idx = sequence_idxs[idx];\n const int batch_idx = batch_idxs[time * num_seqs + sequence_idx];\n if (batch_idx == -1) {\n return;\n }\n\n const unsigned amss_idx = min(static_cast<unsigned>(*epoch), num_am_score_scales - 1);\n const float am_score_scale = am_score_scales[amss_idx];\n\n const unsigned from = from_buffer [idx];\n const unsigned to = to_buffer [idx];\n const unsigned emission_idx = emission_idxs[idx];\n const unsigned lenmod_idx = lenmod_idxs [idx];\n const float edge_weight = weight_buffer[idx];\n const unsigned next_frame_idx = time % num_ringbuffer_frames;\n\n float const* am_buffer_in = am_scores + batch_idx * num_seg_frames * num_emissions + emission_idx;\n float const* length_scores = length_models + lenmod_idx * num_seg_frames;\n float* edge_buffer_out = edge_buffer + idx;\n\n float acc_val = CUDART_INF_F;\n\n for (unsigned i = 0u; i < max_seg_frames; i++) {\n const unsigned prev_frame_idx = (next_frame_idx + i + 1) % num_ringbuffer_frames;\n const float prev_val = state_buffer[prev_frame_idx * num_states + from];\n if (isinf(prev_val)) {\n edge_buffer_out[i * num_edges] = CUDART_INF_F;\n }\n else {\n const float val = prev_val + edge_weight + am_score_scale * am_buffer_in[i * num_emissions] + length_scores[i];\n edge_buffer_out[i * num_edges] += prev_val;\n acc_val = prob_add(acc_val, val);\n }\n }\n\n atomic_prob_add(state_buffer + next_frame_idx * num_states + to, acc_val);\n }\n ', '013_atomic_prob_add': '\n __device__\n void atomic_prob_add(float* a, float b) {\n int* addr = (int*)a;\n int old = __float_as_int(*a);\n int assumed;\n do {\n assumed = old;\n old = atomicCAS(addr, assumed, __float_as_int(prob_add(__int_as_float(old), b)));\n } while (old != assumed);\n }\n ', '020_dump_to_file': "\n template<typename T>\n void dump_to_file_1d(T* d_mem, unsigned n_d1, std::string const& path) {\n std::vector<T> buffer(n_d1);\n cudaMemcpy(buffer.data(), d_mem, buffer.size() * sizeof(T), cudaMemcpyDeviceToHost);\n\n std::ofstream output(path.c_str(), std::ios::trunc | std::ios::out);\n for (size_t i1 = 0ul; i1 < n_d1; i1++) {\n T val = buffer[i1];\n if (!std::numeric_limits<T>::has_infinity or !std::isinf(val)) {\n output << i1 << ' ' << val << '\\n';\n }\n }\n }\n\n template<typename T>\n void dump_to_file_2d(T* d_mem, unsigned n_d1, unsigned n_d2, std::string const& path) {\n std::vector<T> buffer(n_d1 * n_d2);\n cudaMemcpy(buffer.data(), d_mem, buffer.size() * sizeof(T), cudaMemcpyDeviceToHost);\n\n std::ofstream output(path.c_str(), std::ios::trunc | std::ios::out);\n for (size_t i1 = 0ul; i1 < n_d1; i1++) {\n for (size_t i2 = 0ul; i2 < n_d2; i2++) {\n T val = buffer[i1 * n_d2 + i2];\n if (!std::numeric_limits<T>::has_infinity or !std::isinf(val)) {\n output << i1 << ' ' << i2 << ' ' << val << '\\n';\n }\n }\n }\n }\n\n template<typename T>\n void dump_to_file_3d(T* d_mem, unsigned n_d1, unsigned n_d2, unsigned n_d3, std::string const& path) {\n std::vector<T> buffer(n_d1 * n_d2 * n_d3);\n cudaMemcpy(buffer.data(), d_mem, buffer.size() * sizeof(T), cudaMemcpyDeviceToHost);\n\n std::ofstream output(path.c_str(), std::ios::trunc | std::ios::out);\n for (size_t i1 = 0ul; i1 < n_d1; i1++) {\n for (size_t i2 = 0ul; i2 < n_d2; i2++) {\n for (size_t i3 = 0ul; i3 < n_d3; i3++) {\n T val = buffer[i1 * n_d2 * n_d3 + i2 * n_d3 + i3];\n if (!std::numeric_limits<T>::has_infinity or !std::isinf(val)) {\n output << i1 << ' ' << i2 << ' ' << i3 << ' ' << val << '\\n';\n }\n }\n }\n }\n }\n ", '012_prob_add': '\n __device__\n float prob_add(float a, float b) {\n float diff = a - b;\n if (isnan(diff)) {\n return CUDART_INF_F;\n }\n else {\n return -log1p(exp(-abs(diff))) + min(a, b);\n }\n }\n ', '106_compute_posterior_weights': '\n __global__\n void compute_posterior_weights(unsigned num_tot_frames, unsigned num_seg_frames, unsigned num_seqs, unsigned num_batches,\n float const* state_buffer, unsigned const* start_states, int const* batch_idxs,\n float const* index, float const* normalization_factors, float* posterior_weigths) {\n const unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx >= num_tot_frames * num_seqs) {\n return;\n }\n\n const unsigned time = idx / num_seqs;\n const unsigned seq_idx = idx % num_seqs;\n\n const int batch_idx = batch_idxs[time * num_seqs + seq_idx];\n if (batch_idx < 0) {\n return;\n }\n\n const float seq_sum = state_buffer[start_states[seq_idx]];\n for (unsigned s = 0u; s < num_seg_frames; s++) {\n const unsigned i = s * num_batches + batch_idx;\n if (index[i] == 0.0) {\n return;\n }\n posterior_weigths[i] = exp(-(normalization_factors[i] - seq_sum));\n }\n }\n ', '104_merge_framewise_sums': '\n __global__\n void merge_framewise_sum(unsigned num_seg_frames, unsigned num_batches, float const* index, float* sum_buffer) {\n const unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx >= num_batches) {\n return;\n }\n\n sum_buffer += idx;\n index += idx;\n\n float sum = sum_buffer[0];\n for (unsigned s = 1; s < num_seg_frames; s++) {\n if (index[s * num_batches] != 0.0f) {\n sum = prob_add(sum, sum_buffer[s * num_batches]);\n }\n }\n\n for (unsigned s = 0; s < num_seg_frames; s++) {\n if (index[s * num_batches] != 0.0f) {\n sum_buffer[s * num_batches] = sum;\n }\n }\n }\n ', '105_compute_targets': '\n __global__\n void compute_targets(unsigned num_tot_frames, unsigned num_seg_frames, unsigned num_edges, unsigned num_batches, unsigned num_seqs, unsigned num_emissions,\n unsigned const* sequence_idxs, unsigned const* emission_idxs, int const* batch_idxs, float const* index,\n float const* edge_buffer, float const* normalization_buffer, float* output_buffer) {\n const unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx >= num_tot_frames * num_seg_frames * num_edges) {\n return;\n }\n\n const unsigned edge_idx = idx % num_edges;\n const unsigned time = idx / (num_edges * num_seg_frames);\n const unsigned seq_idx = sequence_idxs[edge_idx];\n const int batch_idx = batch_idxs[time * num_seqs + seq_idx];\n\n if (batch_idx < 0) {\n return;\n }\n\n const unsigned seg_length = (idx / num_edges) % num_seg_frames;\n\n if (index[seg_length * num_batches + batch_idx] == 0.0) {\n return;\n }\n\n const unsigned emission_idx = emission_idxs[edge_idx];\n const float normalization = normalization_buffer[seg_length * num_batches + batch_idx];\n\n atomic_prob_add(output_buffer + seg_length * num_batches * num_emissions + batch_idx * num_emissions + emission_idx, edge_buffer[idx] - normalization);\n }\n ', '011_remove_inf': '\n __global__\n void remove_inf(float* array, unsigned size) {\n unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;\n if (idx < size) {\n array[idx] = fminf(array[idx], 1e32);\n }\n }\n '}[source]
cpu_support = False[source]
c_fw_code = '\n // inputs: am_scores, batch_idxs, edges, weights, length_models, start_end_states, index, am_score_scales, epoch\n // outputs: output, normalization_factors, posterior_weigths\n assert(n_inputs == 9);\n assert(n_outputs == 3);\n Ndarray* ary_am_scores = inputs[0];\n Ndarray* ary_batch_idxs = inputs[1];\n Ndarray* ary_edges = inputs[2];\n Ndarray* ary_weights = inputs[3];\n Ndarray* ary_start_end_states = inputs[4];\n Ndarray* ary_length_models = inputs[5];\n Ndarray* ary_index = inputs[6];\n Ndarray* ary_am_score_scales = inputs[7];\n Ndarray* ary_epoch = inputs[8];\n Ndarray* ary_out = *outputs[0];\n Ndarray* ary_norm_factors = *outputs[1];\n Ndarray* ary_posterior_weights = *outputs[2];\n\n assert(Ndarray_DIMS(ary_edges)[1] == Ndarray_DIMS(ary_weights)[0]);\n\n static unsigned iter = 0u; // used for debug output\n\n float* d_am_scores = Ndarray_DEV_DATA(ary_am_scores);\n int* d_batch_idxs = reinterpret_cast<int*>(Ndarray_DEV_DATA(ary_batch_idxs));\n unsigned* d_from = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA(ary_edges) + 0 * Ndarray_STRIDE(ary_edges, 0));\n unsigned* d_to = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA(ary_edges) + 1 * Ndarray_STRIDE(ary_edges, 0));\n unsigned* d_emission_idxs = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA(ary_edges) + 2 * Ndarray_STRIDE(ary_edges, 0));\n unsigned* d_lenmod_idxs = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA(ary_edges) + 3 * Ndarray_STRIDE(ary_edges, 0));\n unsigned* d_sequence_idxs = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA(ary_edges) + 4 * Ndarray_STRIDE(ary_edges, 0));\n float* d_weights = Ndarray_DEV_DATA(ary_weights);\n float* d_length_models = Ndarray_DEV_DATA(ary_length_models);\n unsigned* d_start_states = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA(ary_start_end_states) + 0 * Ndarray_STRIDE(ary_start_end_states, 0));\n unsigned* d_end_states = reinterpret_cast<unsigned*>(Ndarray_DEV_DATA(ary_start_end_states) + 1 * Ndarray_STRIDE(ary_start_end_states, 0));\n float* d_index = Ndarray_DEV_DATA(ary_index);\n float* d_am_score_scales = Ndarray_DEV_DATA(ary_am_score_scales);\n float* d_epoch = Ndarray_DEV_DATA(ary_epoch);\n float* d_out = Ndarray_DEV_DATA(ary_out);\n float* d_norm_factors = Ndarray_DEV_DATA(ary_norm_factors);\n float* d_posterior_weights = Ndarray_DEV_DATA(ary_posterior_weights);\n\n const unsigned n_seg_frames = Ndarray_DIMS(ary_am_scores)[0];\n const unsigned n_batches = Ndarray_DIMS(ary_am_scores)[1];\n const unsigned n_emissions = Ndarray_DIMS(ary_am_scores)[2];\n const unsigned n_tot_frames = Ndarray_DIMS(ary_batch_idxs)[0];\n const unsigned n_seqs = Ndarray_DIMS(ary_batch_idxs)[1];\n const unsigned n_edges = Ndarray_DIMS(ary_edges)[1];\n const unsigned n_length_models = Ndarray_DIMS(ary_length_models)[1];\n const unsigned n_am_score_scales = Ndarray_DIMS(ary_am_score_scales)[0];\n const unsigned n_threads = 1024u;\n unsigned n_blocks = (n_edges + n_threads - 1) / n_threads;\n\n unsigned tmp;\n HANDLE_ERROR(cudaMemcpy(&tmp, d_end_states + n_seqs - 1, sizeof(float), cudaMemcpyDeviceToHost));\n\n const unsigned n_states = tmp + 1;\n\n /*std::cerr << "seg frames: " << n_seg_frames << std::endl;\n std::cerr << "batches: " << n_batches << std::endl;\n std::cerr << "emissions: " << n_emissions << std::endl;\n std::cerr << "tot frames: " << n_tot_frames << std::endl;\n std::cerr << "seqs: " << n_seqs << std::endl;\n std::cerr << "edges: " << n_edges << std::endl;\n std::cerr << "length models: " << n_length_models << std::endl;\n std::cerr << "threads: " << n_threads << std::endl;\n std::cerr << "blocks: " << n_blocks << std::endl;\n std::cerr << "num states: " << n_states << std::endl;*/\n\n // initialize edge buffer\n const unsigned edge_buffer_size = n_tot_frames * n_seg_frames * n_edges;\n float* d_edge_buffer = reinterpret_cast<float*>(device_malloc(edge_buffer_size * sizeof(float)));\n HANDLE_LAST_ERROR();\n unsigned n_fill_blocks = (edge_buffer_size + n_threads - 1u) / n_threads;\n fill_array<<<n_fill_blocks, n_threads>>>(d_edge_buffer, std::numeric_limits<float>::infinity(), edge_buffer_size);\n HANDLE_LAST_ERROR();\n\n // initialize the state buffer\n const unsigned n_ringbuffer_frames = n_seg_frames + 1;\n float* d_state_buffer = reinterpret_cast<float*>(device_malloc(n_states * n_ringbuffer_frames * sizeof(float)));\n HANDLE_LAST_ERROR();\n n_fill_blocks = (n_states * n_ringbuffer_frames + n_threads - 1u) / n_threads;\n fill_array<<<n_fill_blocks, n_threads>>>(d_state_buffer, std::numeric_limits<float>::infinity(), n_states * n_ringbuffer_frames);\n HANDLE_LAST_ERROR();\n\n // initialize sum buffer and posterior weigths\n n_fill_blocks = (n_batches * n_seg_frames + n_threads - 1u) / n_threads;\n fill_array<<<n_fill_blocks, n_threads>>>(d_norm_factors, 0.0f, n_batches * n_seg_frames);\n HANDLE_LAST_ERROR();\n fill_array<<<n_fill_blocks, n_threads>>>(d_posterior_weights, 0.0f, n_batches * n_seg_frames);\n HANDLE_LAST_ERROR();\n\n set_start_states<<<1, n_seqs>>>(d_state_buffer, d_start_states);\n HANDLE_LAST_ERROR();\n\n // fwd pass\n for (unsigned t = 0u; t < n_tot_frames; t++) {\n //std::cerr << "fwd t: " << t << " " << n_tot_frames << std::endl;\n float* d_state_buffer_prev = d_state_buffer + ((t - 1) %% n_ringbuffer_frames) * n_states;\n fill_array<<<n_fill_blocks, n_threads>>>(d_state_buffer_prev, std::numeric_limits<float>::infinity(), n_states);\n HANDLE_LAST_ERROR();\n next_frame_fwd<<<n_blocks, n_threads>>>(t, n_states, n_edges, n_emissions, n_seg_frames, n_tot_frames, n_seqs, n_am_score_scales,\n d_sequence_idxs, d_from, d_to, d_weights, d_emission_idxs, d_lenmod_idxs, d_batch_idxs,\n d_am_scores, d_length_models, d_am_score_scales, d_epoch,\n d_state_buffer, d_edge_buffer + t * n_seg_frames * n_edges);\n HANDLE_LAST_ERROR();\n\n //std::stringstream ss;\n //ss << "dump/fwd_state_buffer." << t << ".dump";\n //dump_to_file_2d(d_state_buffer, n_ringbuffer_frames, n_states, ss.str());\n }\n\n //dump_to_file_3d(d_edge_buffer, n_tot_frames, n_seg_frames, n_edges, "dump/fwd_edges.dump");\n\n // bwd pass\n n_fill_blocks = (n_states * n_ringbuffer_frames + n_threads - 1u) / n_threads;\n fill_array<<<n_fill_blocks, n_threads>>>(d_state_buffer, std::numeric_limits<float>::infinity(), n_states * n_ringbuffer_frames);\n HANDLE_LAST_ERROR();\n n_fill_blocks = (n_states + n_threads - 1u) / n_threads;\n for (unsigned t = n_tot_frames; t > 0; t--) {\n //std::cerr << "bwd t: " << t << " " << n_tot_frames << " buffer next: " << ((t-1) %% n_ringbuffer_frames) << std::endl;\n float* d_state_buffer_next = d_state_buffer + ((t - 1) %% n_ringbuffer_frames) * n_states;\n float* d_state_buffer_prev = d_state_buffer + ( t %% n_ringbuffer_frames) * n_states;\n fill_array<<<n_fill_blocks, n_threads>>>(d_state_buffer_next, std::numeric_limits<float>::infinity(), n_states);\n HANDLE_LAST_ERROR();\n init_bwd_state_buffer<<<1, n_seqs>>>(t - 1, n_batches, n_seqs, d_batch_idxs, d_index, d_state_buffer_prev, d_end_states);\n HANDLE_LAST_ERROR();\n next_frame_bwd<<<n_blocks, n_threads>>>(t - 1, n_states, n_edges, n_emissions, n_seg_frames, n_tot_frames, n_seqs, n_am_score_scales,\n d_sequence_idxs, d_to, d_from, d_weights, d_emission_idxs, d_lenmod_idxs, d_batch_idxs,\n d_am_scores, d_length_models, d_am_score_scales, d_epoch,\n d_state_buffer, d_edge_buffer + (t - 1) * n_seg_frames * n_edges);\n HANDLE_LAST_ERROR();\n\n //std::stringstream ss;\n //ss << "dump/bwd_state_buffer." << t << ".dump";\n //dump_to_file_2d(d_state_buffer, n_ringbuffer_frames, n_states, ss.str());\n }\n\n n_blocks = (n_tot_frames * n_seg_frames + n_threads - 1) / n_threads;\n compute_framewise_sum<<<n_blocks, n_threads, n_threads * n_seqs * sizeof(float)>>>(n_tot_frames, n_seqs, n_seg_frames, n_batches, n_edges,\n d_sequence_idxs, d_batch_idxs,\n d_index, d_edge_buffer, d_norm_factors);\n HANDLE_LAST_ERROR();\n\n //dump_to_file_2d(d_norm_factors, n_seg_frames, n_batches, "dump/norm_factors_1.dump");\n\n if (segmentwise_normalization) {\n n_blocks = (n_batches + n_threads - 1) / n_threads;\n merge_framewise_sum<<<n_blocks, n_threads>>>(n_seg_frames, n_batches, d_index, d_norm_factors);\n HANDLE_LAST_ERROR();\n }\n\n //dump_to_file_2d(d_norm_factors, n_seg_frames, n_batches, "dump/norm_factors_2.dump");\n\n n_blocks = (n_tot_frames * n_seqs + n_threads - 1) / n_threads;\n compute_posterior_weights<<<n_blocks, n_threads>>>(n_tot_frames, n_seg_frames, n_seqs, n_batches, d_state_buffer,\n d_start_states, d_batch_idxs, d_index, d_norm_factors, d_posterior_weights);\n HANDLE_LAST_ERROR();\n\n n_fill_blocks = (n_batches * n_seg_frames * n_emissions + n_threads - 1u) / n_threads;\n fill_array<<<n_fill_blocks, n_threads>>>(d_out, std::numeric_limits<float>::infinity(), n_batches * n_seg_frames * n_emissions);\n HANDLE_LAST_ERROR();\n\n n_blocks = (n_tot_frames * n_seg_frames * n_edges + n_threads - 1) / n_threads;\n compute_targets<<<n_blocks, n_threads>>>(n_tot_frames, n_seg_frames, n_edges, n_batches, n_seqs, n_emissions,\n d_sequence_idxs, d_emission_idxs, d_batch_idxs, d_index, d_edge_buffer, d_norm_factors, d_out);\n HANDLE_LAST_ERROR();\n\n //dump_to_file_1d(d_weights, n_edges, "dump/edge_weights.dump");\n //dump_to_file_1d(d_sequence_idxs, n_edges, "dump/sequence_idxs.dump");\n //dump_to_file_2d(d_state_buffer, n_ringbuffer_frames, n_states, "dump/state_buffer.dump");\n //dump_to_file_2d(d_batch_idxs, n_tot_frames, n_seqs, "dump/batch_idxs.dump");\n //dump_to_file_2d(d_index, n_seg_frames, n_batches, "dump/index.dump");\n //dump_to_file_3d(d_edge_buffer, n_tot_frames, n_seg_frames, n_edges, "dump/edges.dump");\n //dump_to_file_3d(d_am_scores, n_seg_frames, n_batches, n_emissions, "dump/am_scores.dump");\n //dump_to_file_3d(d_out, n_seg_frames, n_batches, n_emissions, "dump/targets.dump");\n\n if (dump_targets and iter %% dump_targets_interval == 0) {\n std::stringstream ss;\n ss << "dump/targets_" << iter << ".dump";\n dump_to_file_3d(d_out, n_seg_frames, n_batches, n_emissions, ss.str());\n ss.str("");\n ss.clear();\n ss << "dump/norm_factors_" << iter << ".dump";\n dump_to_file_2d(d_norm_factors, n_seg_frames, n_batches, ss.str());\n ss.str("");\n ss.clear();\n ss << "dump/posterior_weights_" << iter << ".dump";\n dump_to_file_2d(d_posterior_weights, n_seg_frames, n_batches, ss.str());\n }\n\n iter += 1;\n\n device_free(d_state_buffer);\n device_free(d_edge_buffer);\n '[source]