11.. include :: ./../ext_links.txt
22
3- Compiling and Offloading Mechanisms
4- ====================================
3+ Compiling and Offloading `` dpnp `` statements
4+ ============================================
55
6- ``numba-dpex `` can directly compile and offload different data parallel
7- programming constructs and function libraries onto SYCL based devices.
6+ Data Parallel Extension for NumPy* (``dpnp ``) is a drop-in ``NumPy* ``
7+ replacement library built on top of oneMKL and SYCL. ``numba-dpex `` allows
8+ various ``dpnp `` library function calls to be JIT-compiled using the
9+ ``numba_dpex.dpjit `` decorator. Presently, ``numba-dpex `` can compile several
10+ ``dpnp `` array constructors (``ones ``, ``zeros ``, ``full ``, ``empty ``), most
11+ universal functions, ``prange `` loops, and vector expressions using
12+ ``dpnp.ndarray `` objects.
813
9- ``dpnp `` Functions
10- -------------------
14+ An example of a supported usage of ``dpnp `` statements in `` numba-dpex `` is
15+ provided in the following code snippet:
1116
12- Data Parallel Extension for NumPy* (``dpnp ``) is a drop-in ``NumPy* ``
13- replacement library built on top of oneMKL. ``numba-dpex `` allows various
14- ``dpnp `` library function calls to be jit-compiled thorugh its
15- ``numba_dpex.dpjit `` decorator.
1617
17- ``numba-dpex `` implements its own runtime library to support offloading ``dpnp ``
18- library functions to SYCL devices. For each ``dpnp `` function signature to be
19- offloaded, ``numba-dpex `` implements the corresponding direct SYCL function call
20- in the runtime and the function call is inlined in the generated LLVM IR.
18+ .. ``numba-dpex`` implements its own runtime library to support offloading ``dpnp``
19+ .. library functions to SYCL devices. For each ``dpnp`` function signature to be
20+ .. offloaded, ``numba-dpex`` implements the corresponding direct SYCL function call
21+ .. in the runtime and the function call is inlined in the generated LLVM IR.
2122
2223 .. code-block :: python
2324
@@ -27,157 +28,192 @@ in the runtime and the function call is inlined in the generated LLVM IR.
2728
2829 @dpjit
2930 def foo ():
30- return dpnp.ones(10 ) # the function call for this signature
31- # will be generated through the runtime
32- # library and inlined into the LLVM IR
31+ a = dpnp.ones(1024 , device = " gpu" )
32+ return dpnp.sqrt(a)
3333
3434
3535 a = foo()
3636 print (a)
3737 print (type (a))
3838
39- :samp: `dpnp.ones(10) ` will be called through |ol_dpnp_ones(...) |_.
40-
41- The following sections go over as aspects of the dpnp integration inside
42- numba-dpex.
43-
44- Repository Map
45- ---------------
46-
47- - The code for numba-dpex's ``dpnp `` integration runtime resides in the
48- :file: `numba_dpex/core/runtime ` sub-module.
49- - All the |numba.extending.overload |_ for ``dpnp `` array creation/initialization
50- function signatures are implemented in
51- :file: `numba_dpex/dpnp_iface/arrayobj.py `
52- - Each overload's corresponding |numba.extending.intrinsic |_ is implemented in
53- :file: `numba_dpex/dpnp_iface/_intrinsic.py `
54- - Tests resides in :file: `numba_dpex/tests/dpjit_tests/dpnp `.
55-
56- Design
57- -------
58-
59- ``numba_dpex `` uses the |numba.extending.overload | decorator to create a Numba*
60- implementation of a function that can be used in `nopython mode `_ functions.
61- This is done through translation of ``dpnp `` function signature so that they can
62- be called in ``numba_dpex.dpjit `` decorated code.
63-
64- The specific SYCL operation for a certain ``dpnp `` function is performed by the
65- runtime interface. During compiling a function decorated with the ``@dpjit ``
66- decorator, ``numba-dpex `` generates the corresponding SYCL function call through
67- its runtime library and injects it into the LLVM IR through
68- |numba.extending.intrinsic |_. The ``@intrinsic `` decorator is used for marking a
69- ``dpnp `` function as typing and implementing the function in nopython mode using
70- the `llvmlite IRBuilder API `_. This is an escape hatch to build custom LLVM IR
71- that will be inlined into the caller.
72-
73- The code injection logic to enable ``dpnp `` functions calls in the Numba IR is
74- implemented by :mod: `numba_dpex.core.dpnp_iface.arrayobj ` module which replaces
75- Numba*'s :mod: `numba.np.arrayobj `. Each ``dpnp `` function signature is provided
76- with a concrete implementation to generates the actual code using Numba's
77- ``overload `` function API. e.g.:
78-
79- .. code-block :: python
80-
81- @overload (dpnp.ones, prefer_literal = True )
82- def ol_dpnp_ones (
83- shape , dtype = None , order = " C" , device = None , usm_type = " device" , sycl_queue = None
84- ):
85- ...
86-
87- The corresponding intrinsic implementation is in :file: `numba_dpex/dpnp_iface/_intrinsic.py `.
88-
89- .. code-block :: python
90-
91- @intrinsic
92- def impl_dpnp_ones (
93- ty_context ,
94- ty_shape ,
95- ty_dtype ,
96- ty_order ,
97- ty_device ,
98- ty_usm_type ,
99- ty_sycl_queue ,
100- ty_retty_ref ,
101- ):
102- ...
39+ .. :samp:`dpnp.ones(10)` will be called through |ol_dpnp_ones(...)|_.
40+
41+
42+ .. Design
43+ .. -------
44+
45+ .. ``numba_dpex`` uses the |numba.extending.overload| decorator to create a Numba*
46+ .. implementation of a function that can be used in `nopython mode`_ functions.
47+ .. This is done through translation of ``dpnp`` function signature so that they can
48+ .. be called in ``numba_dpex.dpjit`` decorated code.
49+
50+ .. The specific SYCL operation for a certain ``dpnp`` function is performed by the
51+ .. runtime interface. During compiling a function decorated with the ``@dpjit``
52+ .. decorator, ``numba-dpex`` generates the corresponding SYCL function call through
53+ .. its runtime library and injects it into the LLVM IR through
54+ .. |numba.extending.intrinsic|_. The ``@intrinsic`` decorator is used for marking a
55+ .. ``dpnp`` function as typing and implementing the function in nopython mode using
56+ .. the `llvmlite IRBuilder API`_. This is an escape hatch to build custom LLVM IR
57+ .. that will be inlined into the caller.
58+
59+ .. The code injection logic to enable ``dpnp`` functions calls in the Numba IR is
60+ .. implemented by :mod:`numba_dpex.core.dpnp_iface.arrayobj` module which replaces
61+ .. Numba*'s :mod:`numba.np.arrayobj`. Each ``dpnp`` function signature is provided
62+ .. with a concrete implementation to generates the actual code using Numba's
63+ .. ``overload`` function API. e.g.:
64+
65+ .. .. code-block:: python
66+
67+ .. @overload(dpnp.ones, prefer_literal=True)
68+ .. def ol_dpnp_ones(
69+ .. shape, dtype=None, order="C", device=None, usm_type="device", sycl_queue=None
70+ .. ):
71+ .. ...
72+
73+ .. The corresponding intrinsic implementation is in :file:`numba_dpex/dpnp_iface/_intrinsic.py`.
74+
75+ .. .. code-block:: python
76+
77+ .. @intrinsic
78+ .. def impl_dpnp_ones(
79+ .. ty_context,
80+ .. ty_shape,
81+ .. ty_dtype,
82+ .. ty_order,
83+ .. ty_device,
84+ .. ty_usm_type,
85+ .. ty_sycl_queue,
86+ .. ty_retty_ref,
87+ .. ):
88+ .. ...
10389
10490 Parallel Range
10591---------------
10692
107- ``numba-dpex `` implements the ability to run loops in parallel, the language
108- construct is adapted from Numba*'s ``prange `` concept that was initially
109- designed to run OpenMP parallel for loops. In Numba*, the loop-body is scheduled
110- in seperate threads, and they execute in a ``nopython `` Numba* context.
111- ``prange `` automatically takes care of data privatization. ``numba-dpex ``
112- employs the ``prange `` compilation mechanism to offload parallel loop like
113- programming constructs onto SYCL enabled devices.
114-
115- The ``prange `` compilation pass is delegated through Numba's
116- :file: `numba/parfor/parfor_lowering.py ` module where ``numba-dpex `` provides
117- :file: `numba_dpex/core/parfors/parfor_lowerer.py ` module to be used as the
118- *lowering * mechanism through
119- :py:class: `numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl ` class. This
120- provides a custom lowerer for ``prange `` nodes that generates a SYCL kernel for
121- a ``prange `` node and submits it to a queue. Here is an example of a ``prange ``
122- use case in ``@dpjit `` context:
93+ ``numba-dpex `` supports using the ``numba.prange `` statements with
94+ ``dpnp.ndarray `` objects. All such ``prange `` loops are offloaded as kernels and
95+ executed on a device inferred using the compute follows data programming model.
96+ The next examples shows using a ``prange `` loop.
97+
98+ .. implements the ability to run loops in parallel, the language
99+ .. construct is adapted from Numba*'s ``prange`` concept that was initially
100+ .. designed to run OpenMP parallel for loops. In Numba*, the loop-body is scheduled
101+ .. in seperate threads, and they execute in a ``nopython`` Numba* context.
102+ .. ``prange`` automatically takes care of data privatization. ``numba-dpex``
103+ .. employs the ``prange`` compilation mechanism to offload parallel loop like
104+ .. programming constructs onto SYCL enabled devices.
105+
106+ .. The ``prange`` compilation pass is delegated through Numba's
107+ .. :file:`numba/parfor/parfor_lowering.py` module where ``numba-dpex`` provides
108+ .. :file:`numba_dpex/core/parfors/parfor_lowerer.py` module to be used as the
109+ .. *lowering* mechanism through
110+ .. :py:class:`numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl` class. This
111+ .. provides a custom lowerer for ``prange`` nodes that generates a SYCL kernel for
112+ .. a ``prange`` node and submits it to a queue. Here is an example of a ``prange``
113+ .. use case in ``@dpjit`` context:
123114
124115 .. code-block :: python
125116
126- from numba import prange
127117 import dpnp
128- from numba_dpex import dpjit
118+ from numba_dpex import dpjit, prange
129119
130120
131121 @dpjit
132- def foo (a , b ):
133- x = dpnp.ones(10 )
134- for i in prange( 10 ):
135- x[i] = a[i] + b[i]
136- return x
137-
122+ def foo ():
123+ x = dpnp.ones(1024 , device = " gpu " )
124+ o = dpnp.empty_like(a)
125+ for i in prange(x.shape[ 0 ]):
126+ o[i] = x[i] * x[i]
127+ return o
138128
139- a = dpnp.ones(10 )
140- b = dpnp.ones(10 )
141129
142- c = foo(a, b )
130+ c = foo()
143131 print (c)
144132 print (type (c))
145133
146- Each ``prange `` instruction in Numba* has an optional *lowerer * attribute. The
147- lowerer attribute determines how the parfor instruction should be lowered to
148- LLVM IR. In addition, the lower attribute decides which ``prange `` instructions
149- can be fused together. At this point ``numba-dpex `` does not generate
150- device-specific code and the lowerer used is same for all device types. However,
151- a different :py:class: `numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl `
152- instance is returned for every ``prange `` instruction for each corresponding CFD
153- (Compute Follows Data) inferred device to prevent illegal ``prange `` fusion.
154-
155-
156- Fusion of Kernels
157- ------------------
158-
159- ``numba-dpex `` can identify each NumPy* (or ``dpnp ``) array expression as a
160- data-parallel kernel and fuse them together to generate a single SYCL kernel.
161- The kernel is automatically offloaded to the specified device where the fusion
162- operation is invoked. Here is a simple example of a Black-Scholes formula
163- computation where kernel fusion occurs at different ``dpnp `` math functions:
164-
165- .. literalinclude :: ./../../../numba_dpex/examples/blacksholes_njit.py
166- :language: python
167- :pyobject: blackscholes
168- :caption: **EXAMPLE: ** Data parallel kernel implementing the vector sum a+b
169- :name: blackscholes_dpjit
170-
171-
172- .. |numba.extending.overload | replace :: ``numba.extending.overload ``
173- .. |numba.extending.intrinsic | replace :: ``numba.extending.intrinsic ``
174- .. |ol_dpnp_ones(...) | replace :: ``ol_dpnp_ones(...) ``
175- .. |numba.np.arrayobj | replace :: ``numba.np.arrayobj ``
176-
177- .. _low-level API : https://github.com/IntelPython/dpnp/tree/master/dpnp/backend
178- .. _`ol_dpnp_ones(...)` : https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/dpnp_iface/arrayobj.py#L358
179- .. _`numba.extending.overload` : https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-functions
180- .. _`numba.extending.intrinsic` : https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-intrinsics
181- .. _nopython mode : https://numba.pydata.org/numba-doc/latest/glossary.html#term-nopython-mode
182- .. _`numba.np.arrayobj` : https://github.com/numba/numba/blob/main/numba/np/arrayobj.py
183- .. _`llvmlite IRBuilder API` : http://llvmlite.pydata.org/en/latest/user-guide/ir/ir-builder.html
134+ .. Each ``prange`` instruction in Numba* has an optional *lowerer* attribute. The
135+ .. lowerer attribute determines how the parfor instruction should be lowered to
136+ .. LLVM IR. In addition, the lower attribute decides which ``prange`` instructions
137+ .. can be fused together. At this point ``numba-dpex`` does not generate
138+ .. device-specific code and the lowerer used is same for all device types. However,
139+ .. a different :py:class:`numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl`
140+ .. instance is returned for every ``prange`` instruction for each corresponding CFD
141+ .. (Compute Follows Data) inferred device to prevent illegal ``prange`` fusion.
142+
143+ ``prange `` loop statements can also be used to write reduction loops as
144+ demonstrated by the following naive pairwise distance computation.
145+
146+ .. code-block :: python
147+
148+ from numba_dpex import dpjit, prange
149+ import dpnp
150+ import dpctl
151+
152+
153+ @dpjit
154+ def pairwise_distance (X1 , X2 , D ):
155+ """ Naïve pairwise distance impl - take an array representing M points in N
156+ dimensions, and return the M x M matrix of Euclidean distances
157+
158+ Args:
159+ X1 : Set of points
160+ X2 : Set of points
161+ D : Outputted distance matrix
162+ """
163+ # Size of inputs
164+ X1_rows = X1.shape[0 ]
165+ X2_rows = X2.shape[0 ]
166+ X1_cols = X1.shape[1 ]
167+
168+ float0 = X1.dtype.type(0.0 )
169+
170+ # Outermost parallel loop over the matrix X1
171+ for i in prange(X1_rows):
172+ # Loop over the matrix X2
173+ for j in range (X2_rows):
174+ d = float0
175+ # Compute exclidean distance
176+ for k in range (X1_cols):
177+ tmp = X1[i, k] - X2[j, k]
178+ d += tmp * tmp
179+ # Write computed distance to distance matrix
180+ D[i, j] = dpnp.sqrt(d)
181+
182+
183+ q = dpctl.SyclQueue()
184+ X1 = dpnp.ones((10 , 2 ), sycl_queue = q)
185+ X2 = dpnp.zeros((10 , 2 ), sycl_queue = q)
186+ D = dpnp.empty((10 , 2 ), sycl_queue = q)
187+
188+ pairwise_distance(X1, X2, D)
189+ print (D)
190+
191+
192+ .. Fusion of Kernels
193+ .. ------------------
194+
195+ .. ``numba-dpex`` can identify each NumPy* (or ``dpnp``) array expression as a
196+ .. data-parallel kernel and fuse them together to generate a single SYCL kernel.
197+ .. The kernel is automatically offloaded to the specified device where the fusion
198+ .. operation is invoked. Here is a simple example of a Black-Scholes formula
199+ .. computation where kernel fusion occurs at different ``dpnp`` math functions:
200+
201+ .. .. literalinclude:: ./../../../numba_dpex/examples/blacksholes_njit.py
202+ .. :language: python
203+ .. :pyobject: blackscholes
204+ .. :caption: **EXAMPLE:** Data parallel kernel implementing the vector sum a+b
205+ .. :name: blackscholes_dpjit
206+
207+
208+ .. .. |numba.extending.overload| replace:: ``numba.extending.overload``
209+ .. .. |numba.extending.intrinsic| replace:: ``numba.extending.intrinsic``
210+ .. .. |ol_dpnp_ones(...)| replace:: ``ol_dpnp_ones(...)``
211+ .. .. |numba.np.arrayobj| replace:: ``numba.np.arrayobj``
212+
213+ .. .. _low-level API: https://github.com/IntelPython/dpnp/tree/master/dpnp/backend
214+ .. .. _`ol_dpnp_ones(...)`: https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/dpnp_iface/arrayobj.py#L358
215+ .. .. _`numba.extending.overload`: https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-functions
216+ .. .. _`numba.extending.intrinsic`: https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-intrinsics
217+ .. .. _nopython mode: https://numba.pydata.org/numba-doc/latest/glossary.html#term-nopython-mode
218+ .. .. _`numba.np.arrayobj`: https://github.com/numba/numba/blob/main/numba/np/arrayobj.py
219+ .. .. _`llvmlite IRBuilder API`: http://llvmlite.pydata.org/en/latest/user-guide/ir/ir-builder.html
0 commit comments