@@ -6,12 +6,15 @@ Overview
66
77Data Parallel Extension for Numba* (`numba-dpex `_) is a free and open-source
88LLVM-based code generator for portable accelerator programming in Python.
9- numba_dpex defines a new kernel programming domain-specific language (DSL)
10- in pure Python called `KAPI ` that is modeled after the C++ embedded DSL
11- `SYCL* `_.
9+ numba_dpex defines a new kernel programming domain-specific language (DSL) in
10+ pure Python called `KAPI ` that is modeled after the C++ embedded DSL `SYCL* `_. A
11+ KAPI function can be JIT compiled by numba-dpex to generate a "data-parallel"
12+ kernel function that executes in parallel on a supported device. Currently,
13+ compilation of KAPI is possible for x86 CPU devices (using OpenCL CPU drivers),
14+ Intel Gen9 integrated GPUs, Intel UHD integrated GPUs, and Intel discrete GPUs.
1215
13- The following example illustrates a relatively simple pairwise distance matrix
14- computation example written in KAPI .
16+ The following example presents an example that uses KAPI to code a pairwise
17+ distance computation .
1518
1619.. code-block :: python
1720
@@ -35,158 +38,59 @@ computation example written in KAPI.
3538
3639
3740 data = np.random.ranf((10000 , 3 )).astype(np.float32)
38- distance = np.empty(shape = (data.shape[0 ], data.shape[0 ]), dtype = np.float32)
41+ dist = np.empty(shape = (data.shape[0 ], data.shape[0 ]), dtype = np.float32)
3942 exec_range = kapi.Range(data.shape[0 ], data.shape[0 ])
40- kapi.call_kernel(pairwise_distance_kernel, exec_range, data, distance)
41-
42- Skipping over much of the language details, at a high-level the
43- ``pairwise_distance_kernel `` can be viewed as a "data-parallel" function that
44- gets executed individually by a set of "work items". That is, each work item
45- runs the same function for a subset of the elements of the input ``data `` and
46- ``distance `` arrays. For programmers familiar with the CUDA or OpenCL languages,
47- it is the same programming model referred to as Single Program Multiple Data
48- (SPMD). As Python has no concept of a work item the KAPI function runs
49- sequentially resulting in a very slow execution time. Experienced Python
50- programmers will most probably write a much faster version of the function using
51- NumPy*.
52-
53- However, using a JIT compiler numba-dpex can compile a function written in the
54- KAPI language to a CPython native extension function that executes according to
55- the SPMD programming model, speeding up the execution time by orders of
56- magnitude. Currently, compilation of KAPI is possible for x86 CPU devices,
57- Intel Gen9 integrated GPUs, Intel UHD integrated GPUs, and Intel discrete GPUs.
58-
59-
60- ``numba-dpex `` is an open-source project and can be installed as part of `Intel
61- AI Analytics Toolkit `_ or the `Intel Distribution for Python* `_. The package is
62- also available on Anaconda cloud and as a Docker image on GitHub. Please refer
63- the :doc: `getting_started ` page to learn more.
64-
65- Main Features
66- -------------
67-
68- Portable Kernel Programming
69- ~~~~~~~~~~~~~~~~~~~~~~~~~~~
70-
71- The ``numba-dpex `` kernel programming API has a design similar to Numba's
72- ``cuda.jit `` sub-module. The API is modeled after the `SYCL* `_ language and uses
73- the `DPC++ `_ SYCL runtime. Currently, compilation of kernels is supported for
74- SPIR-V-based OpenCL and `oneAPI Level Zero `_ devices CPU and GPU devices. In the
75- future, compilation support for other types of hardware that are supported by
76- DPC++ will be added.
77-
78- The following example illustrates a vector addition kernel written with
79- ``numba-dpex `` kernel API.
43+ kapi.call_kernel(pairwise_distance_kernel, exec_range, data, dist)
44+
45+ The ``pairwise_distance_kernel `` function is conceptually a "data-parallel"
46+ function that gets executed individually by a set of "work items". That is, each
47+ work item runs the same function for a subset of the elements of the input
48+ **data ** and **distance ** arrays. For programmers familiar with the CUDA or
49+ OpenCL languages, it is the programming model referred to as Single Program
50+ Multiple Data (SPMD). Although a KAPI function is conceptually following the
51+ SPMD model, as Python has no concept of a work item a KAPI function
52+ runs sequentially in Python and needs to be JIT compiled for parallel execution.
53+
54+ JIT compiling a KAPI function only requires adding the ``dpex.kernel `` decorator
55+ to the function and calling the function from the ``dpex.call_kernel `` method.
56+ It should be noted that a JIT compiled KAPI function does not support passing in
57+ NumPy arrays. A KAPI function can only be called using either ``dpnp.ndarray ``
58+ or ``dpctl.tensor.usm_ndarray `` array objects. The restriction is due to a
59+ compiled KAPI function requiring memory that was allocated on the device where
60+ the kernel should execute. Refer the :doc: `programming_model ` and kernel
61+ programming user guide for further details. The modification to
62+ ``pairwise_distance_kernel `` function for JIT compilation are shown in the next
63+ example.
8064
8165.. code-block :: python
8266
83- import dpnp
67+ from numba_dpex import kernel_api as kapi
8468 import numba_dpex as dpex
85-
86-
87- @dpex.kernel
88- def vecadd_kernel (a , b , c ):
89- i = dpex.get_global_id(0 )
90- c[i] = a[i] + b[i]
91-
92-
93- a = dpnp.ones(1024 , device = " gpu" )
94- b = dpnp.ones(1024 , device = " gpu" )
95- c = dpnp.empty_like(a)
96-
97- vecadd_kernel[dpex.Range(1024 )](a, b, c)
98- print (c)
99-
100- In the above example, three arrays are allocated on a default ``gpu `` device
101- using the ``dpnp `` library. The arrays are then passed as input arguments to the
102- kernel function. The compilation target and the subsequent execution of the
103- kernel is determined by the input arguments and follow the
104- "compute-follows-data" programming model as specified in the `Python* Array API
105- Standard `_. To change the execution target to a CPU, the device keyword needs to
106- be changed to ``cpu `` when allocating the ``dpnp `` arrays. It is also possible
107- to leave the ``device `` keyword undefined and let the ``dpnp `` library select a
108- default device based on environment flag settings. Refer the
109- :doc: `user_guide/kernel_programming/index ` for further details.
110-
111- ``dpjit `` decorator
112- ~~~~~~~~~~~~~~~~~~~
113-
114- The ``numba-dpex `` package provides a new decorator ``dpjit `` that extends
115- Numba's ``njit `` decorator. The new decorator is equivalent to
116- ``numba.njit(parallel=True) ``, but additionally supports compiling ``dpnp ``
117- functions, ``prange `` loops, and array expressions that use ``dpnp.ndarray ``
118- objects.
119-
120- Unlike Numba's NumPy parallelization that only supports CPUs, ``dpnp ``
121- expressions are first converted to data-parallel kernels and can then be
122- `offloaded ` to different types of devices. As ``dpnp `` implements the same API
123- as NumPy*, an existing ``numba.njit `` decorated function that uses
124- ``numpy.ndarray `` may be refactored to use ``dpnp.ndarray `` and decorated with
125- ``dpjit ``. Such a refactoring can allow the parallel regions to be offloaded
126- to a supported GPU device, providing users an additional option to execute their
127- code parallelly.
128-
129- The vector addition example depicted using the kernel API can also be
130- expressed in several different ways using ``dpjit ``.
131-
132- .. code-block :: python
133-
69+ import math
13470 import dpnp
135- import numba_dpex as dpex
136-
137-
138- @dpex.dpjit
139- def vecadd_v1 (a , b ):
140- return a + b
14171
14272
143- @dpex.dpjit
144- def vecadd_v2 (a , b ):
145- return dpnp.add(a, b)
146-
147-
148- @dpex.dpjit
149- def vecadd_v3 (a , b ):
150- c = dpnp.empty_like(a)
151- for i in prange(a.shape[0 ]):
152- c[i] = a[i] + b[i]
153- return c
154-
155- As with the kernel API example, a ``dpjit `` function if invoked with ``dpnp ``
156- input arguments follows the compute-follows-data programming model. Refer
157- :doc: `user_manual/dpnp_offload/index ` for further details.
158-
159-
160- .. Project Goal
161- .. ------------
162-
163- .. If C++ is not your language, you can skip writing data-parallel kernels in SYCL
164- .. and directly write them in Python.
165-
166- .. Our package ``numba-dpex`` extends the Numba compiler to allow kernel creation
167- .. directly in Python via a custom compute API
168-
73+ @dpex.kernel
74+ def pairwise_distance_kernel (item : kapi.Item, data , distance ):
75+ i = item.get_id(0 )
76+ j = item.get_id(1 )
16977
170- .. Contributing
171- .. ------------
78+ data_dims = data.shape[1 ]
17279
173- .. Refer the `contributing guide
174- .. <https://github.com/IntelPython/numba-dpex/blob/main/CONTRIBUTING>`_ for
175- .. information on coding style and standards used in ``numba-dpex``.
80+ d = data.dtype.type(0.0 )
81+ for k in range (data_dims):
82+ tmp = data[i, k] - data[j, k]
83+ d += tmp * tmp
17684
177- .. License
178- .. -------
85+ distance[j, i] = math.sqrt(d)
17986
180- .. ``numba-dpex`` is Licensed under Apache License 2.0 that can be found in `LICENSE
181- .. <https://github.com/IntelPython/numba-dpex/blob/main/LICENSE>`_. All usage and
182- .. contributions to the project are subject to the terms and conditions of this
183- .. license.
18487
88+ data = dpnp.random.ranf((10000 , 3 )).astype(dpnp.float32)
89+ dist = dpnp.empty(shape = (data.shape[0 ], data.shape[0 ]), dtype = dpnp.float32)
90+ exec_range = kapi.Range(data.shape[0 ], data.shape[0 ])
91+ dpex.call_kernel(pairwise_distance_kernel, exec_range, data, dist)
18592
186- .. Along with the kernel programming API an auto-offload feature is also provided.
187- .. The feature enables automatic generation of kernels from data-parallel NumPy
188- .. library calls and array expressions, Numba ``prange`` loops, and `other
189- .. "data-parallel by construction" expressions
190- .. <https://numba.pydata.org/numba-doc/latest/user/parallel.html>`_ that Numba is
191- .. able to parallelize. Following two examples demonstrate the two ways in which
192- .. kernels may be written using numba-dpex.
93+ ``numba-dpex `` is an open-source project and can be installed as part of `Intel
94+ AI Analytics Toolkit `_ or the `Intel Distribution for Python* `_. The package is
95+ also available on Anaconda cloud, PyPi, and as a Docker image on GitHub.
96+ Refer the :doc: `getting_started ` page for further details.
0 commit comments