Skip to content

Commit 884423a

Browse files
authored
Update tuning_guide.py
1 parent b978140 commit 884423a

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

recipes_source/recipes/tuning_guide.py

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,7 @@ def fused_gelu(x):
189189
#
190190
# In general cases the following command executes a PyTorch script on cores on the Nth node only, and avoids cross-socket memory access to reduce memory access overhead.
191191

192-
``numactl --cpunodebind=N --membind=N python <pytorch_script>``
192+
# numactl --cpunodebind=N --membind=N python <pytorch_script>
193193

194194
###############################################################################
195195
# More detailed descriptions can be found `here <https://software.intel.com/content/www/us/en/develop/articles/how-to-get-better-performance-on-pytorchcaffe2-with-intel-acceleration.html>`_.
@@ -204,37 +204,37 @@ def fused_gelu(x):
204204
###############################################################################
205205
# With the following command, PyTorch run the task on N OpenMP threads.
206206

207-
``export OMP_NUM_THREADS=N``
207+
# export OMP_NUM_THREADS=N
208208

209209
###############################################################################
210210
# Typically, the following environment variables are used to set for CPU affinity with GNU OpenMP implementation. OMP_PROC_BIND specifies whether threads may be moved between processors. Setting it to CLOSE keeps OpenMP threads close to the primary thread in contiguous place partitions. OMP_SCHEDULE determines how OpenMP threads are scheduled. GOMP_CPU_AFFINITY binds threads to specific CPUs.
211211

212-
``export OMP_SCHEDULE=STATIC``
213-
``export OMP_PROC_BIND=CLOSE``
214-
``export GOMP_CPU_AFFINITY="N-M"``
212+
# export OMP_SCHEDULE=STATIC
213+
# export OMP_PROC_BIND=CLOSE
214+
# export GOMP_CPU_AFFINITY="N-M"
215215

216216
###############################################################################
217217
# Intel OpenMP Runtime Library (libiomp)
218218
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
219219
# By default, PyTorch uses GNU OpenMP (GNU libgomp) for parallel computation. On Intel platforms, Intel OpenMP Runtime Library (libiomp) provides OpenMP API specification support. It sometimes brings more performance benefits compared to libgomp. Utilizing environment variable LD_PRELOAD can switch OpenMP library to libiomp:
220220

221-
``export LD_PRELOAD=<path>/libiomp5.so:$LD_PRELOAD``
221+
# export LD_PRELOAD=<path>/libiomp5.so:$LD_PRELOAD
222222

223223
###############################################################################
224224
# Similar to CPU affinity settings in GNU OpenMP, environment variables are provided in libiomp to control CPU affinity settings.
225225
# KMP_AFFINITY binds OpenMP threads to physical processing units. KMP_BLOCKTIME sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping. In most cases, setting KMP_BLOCKTIME to 1 or 0 yields good performances.
226226
# The following commands show a common settings with Intel OpenMP Runtime Library.
227227

228-
``export KMP_AFFINITY=granularity=fine,compact,1,0``
229-
``export KMP_BLOCKTIME=1``
228+
# export KMP_AFFINITY=granularity=fine,compact,1,0
229+
# export KMP_BLOCKTIME=1
230230

231231
###############################################################################
232232
# Switch Memory allocator
233233
# ~~~~~~~~~~~~~~~~~~~~~~~
234234
# For deep learning workloads, Jemalloc or TCMalloc can get better performance by reusing memory as much as possible than default malloc funtion. `Jemalloc <https://github.com/jemalloc/jemalloc>`_ is a general purpose malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support. `TCMalloc <https://google.github.io/tcmalloc/overview.html>`_ also features a couple of optimizations to speed up program executions. One of them is holding memory in caches to speed up access of commonly-used objects. Holding such caches even after deallocation also helps avoid costly system calls if such memory is later re-allocated.
235235
# Use environment variable LD_PRELOAD to take advantage of one of them.
236236

237-
``export LD_PRELOAD=<jemalloc.so/tcmalloc.so>:$LD_PRELOAD``
237+
# export LD_PRELOAD=<jemalloc.so/tcmalloc.so>:$LD_PRELOAD
238238

239239
###############################################################################
240240
# Train a model on CPU with PyTorch DistributedDataParallel(DDP) functionality

0 commit comments

Comments
 (0)