Skip to content

Numba-dpex generating incorrect results for sliding window matmul kernel #892

@diptorupd

Description

@diptorupd

Here's an example that I can't get to work: it's basically a port of this numba.cuda example "fast matmul" where threads load cooperatively two arrays of shared memories.

import sklearn_numba_dpex
import numba_dpex as dpex
import numpy as np
import dpctl.tensor as dpt

square_block_side = 2
work_group_size = (square_block_side, square_block_side)
dtype = np.float32


@dpex.kernel
def matmul(
    X,         # IN READ-ONLY    (X_n_rows, n_cols)
    y,         # IN READ-ONLY    (n_cols, y_n_rows),
    result,    # OUT             (X_n_rows, y_n_rows)
):

    X_n_rows = X.shape[0]
    Y_n_cols = y.shape[1]
    n_cols = X.shape[1]

    result_row_idx = dpex.get_global_id(0)
    result_col_idx = dpex.get_global_id(1)

    local_row_idx = dpex.get_local_id(0)
    local_col_idx = dpex.get_local_id(1)

    n_blocks_for_cols = n_cols // square_block_side
    if (n_cols % square_block_side) > 0:
        n_blocks_for_cols += 1

    X_sliding_window = dpex.local.array(shape=work_group_size, dtype=dtype)
    Y_sliding_window = dpex.local.array(shape=work_group_size, dtype=dtype)

    output = dtype(0)

    for block_idx in range(n_blocks_for_cols):
        X_sliding_window[local_row_idx, local_col_idx] = dtype(0)
        Y_sliding_window[local_row_idx, local_col_idx] = dtype(0)
        if (result_row_idx < X_n_rows) and (
            (local_col_idx + (square_block_side * block_idx)) < n_cols
        ):
            X_sliding_window[local_row_idx, local_col_idx] = X[
                result_row_idx, local_col_idx + (square_block_side * block_idx)
            ]

        if (result_col_idx < Y_n_cols) and (
            (local_row_idx + (square_block_side * block_idx)) < n_cols
        ):
            Y_sliding_window[local_row_idx, local_col_idx] = y[
                local_row_idx + (square_block_side * block_idx), result_col_idx
            ]

        dpex.barrier(dpex.CLK_LOCAL_MEM_FENCE)

        for idx in range(square_block_side):
            output += (
                X_sliding_window[local_row_idx, idx]
                * Y_sliding_window[idx, local_col_idx]
            )

        dpex.barrier(dpex.CLK_LOCAL_MEM_FENCE)

    if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
        result[result_row_idx, result_col_idx] = output


def _arange_reshaped(shape, dtype):
    n_items = shape[0] * shape[1]
    return np.arange(n_items, dtype=dtype).reshape(shape)


X = _arange_reshaped((5, 5), dtype)
Y = _arange_reshaped((5, 5), dtype)

print(np.matmul(X, Y))

X = dpt.asarray(X)
Y = dpt.asarray(Y)

device = X.device.sycl_device
result = dpt.zeros((5, 5), dtype, device=device)

matmul[(6,6), (2,2)](X, Y, result)

print(result)

Output"

# expected output
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [1150. 1260. 1370. 1480. 1590.]]

# kernel output
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [ 700.  766.  832.  898.  964.]]

I've tried many variations of it with no success, the last row of the output always has wrong values. Note that this seems to be deterministic: the values in the last row are always the same. Maybe there's an error in my snippet but I've questioned each row of it already and tried to inspect each of it, enough to start thinking the compiled code might be wrong instead even if the kernel is written right. WDYT ?

Originally posted by @fcharras in #871 (comment)

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinguserUser submitted issue

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions