-
Notifications
You must be signed in to change notification settings - Fork 33
Closed
Labels
Milestone
Description
Here's an example that I can't get to work: it's basically a port of this numba.cuda example "fast matmul" where threads load cooperatively two arrays of shared memories.
import sklearn_numba_dpex
import numba_dpex as dpex
import numpy as np
import dpctl.tensor as dpt
square_block_side = 2
work_group_size = (square_block_side, square_block_side)
dtype = np.float32
@dpex.kernel
def matmul(
X, # IN READ-ONLY (X_n_rows, n_cols)
y, # IN READ-ONLY (n_cols, y_n_rows),
result, # OUT (X_n_rows, y_n_rows)
):
X_n_rows = X.shape[0]
Y_n_cols = y.shape[1]
n_cols = X.shape[1]
result_row_idx = dpex.get_global_id(0)
result_col_idx = dpex.get_global_id(1)
local_row_idx = dpex.get_local_id(0)
local_col_idx = dpex.get_local_id(1)
n_blocks_for_cols = n_cols // square_block_side
if (n_cols % square_block_side) > 0:
n_blocks_for_cols += 1
X_sliding_window = dpex.local.array(shape=work_group_size, dtype=dtype)
Y_sliding_window = dpex.local.array(shape=work_group_size, dtype=dtype)
output = dtype(0)
for block_idx in range(n_blocks_for_cols):
X_sliding_window[local_row_idx, local_col_idx] = dtype(0)
Y_sliding_window[local_row_idx, local_col_idx] = dtype(0)
if (result_row_idx < X_n_rows) and (
(local_col_idx + (square_block_side * block_idx)) < n_cols
):
X_sliding_window[local_row_idx, local_col_idx] = X[
result_row_idx, local_col_idx + (square_block_side * block_idx)
]
if (result_col_idx < Y_n_cols) and (
(local_row_idx + (square_block_side * block_idx)) < n_cols
):
Y_sliding_window[local_row_idx, local_col_idx] = y[
local_row_idx + (square_block_side * block_idx), result_col_idx
]
dpex.barrier(dpex.CLK_LOCAL_MEM_FENCE)
for idx in range(square_block_side):
output += (
X_sliding_window[local_row_idx, idx]
* Y_sliding_window[idx, local_col_idx]
)
dpex.barrier(dpex.CLK_LOCAL_MEM_FENCE)
if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
result[result_row_idx, result_col_idx] = output
def _arange_reshaped(shape, dtype):
n_items = shape[0] * shape[1]
return np.arange(n_items, dtype=dtype).reshape(shape)
X = _arange_reshaped((5, 5), dtype)
Y = _arange_reshaped((5, 5), dtype)
print(np.matmul(X, Y))
X = dpt.asarray(X)
Y = dpt.asarray(Y)
device = X.device.sycl_device
result = dpt.zeros((5, 5), dtype, device=device)
matmul[(6,6), (2,2)](X, Y, result)
print(result)Output"
# expected output
[[ 150. 160. 170. 180. 190.]
[ 400. 435. 470. 505. 540.]
[ 650. 710. 770. 830. 890.]
[ 900. 985. 1070. 1155. 1240.]
[1150. 1260. 1370. 1480. 1590.]]
# kernel output
[[ 150. 160. 170. 180. 190.]
[ 400. 435. 470. 505. 540.]
[ 650. 710. 770. 830. 890.]
[ 900. 985. 1070. 1155. 1240.]
[ 700. 766. 832. 898. 964.]]
I've tried many variations of it with no success, the last row of the output always has wrong values. Note that this seems to be deterministic: the values in the last row are always the same. Maybe there's an error in my snippet but I've questioned each row of it already and tried to inspect each of it, enough to start thinking the compiled code might be wrong instead even if the kernel is written right. WDYT ?
Originally posted by @fcharras in #871 (comment)