review comments on NHWC doc (#990)

jjsjann123 · holly1238 · web-flow · commit f13d5e1d5a8b · 2021-04-13T10:15:43.000-07:00
* review comments on NHWC doc

* remove p-wise output format note per review comments

Co-authored-by: holly1238 &lt;77758406+holly1238@users.noreply.github.com&gt;
diff --git a/intermediate_source/memory_format_tutorial.py b/intermediate_source/memory_format_tutorial.py
@@ -7,29 +7,26 @@
 What is Channels Last
 ---------------------
 
-Channels Last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels Last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel).
+Channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel).
 
 For example, classic (contiguous) storage of NCHW tensor (in our case it is two 2x2 images with 3 color channels) look like this:
 
 .. figure:: /_static/img/classic_memory_format.png
    :alt: classic_memory_format
 
-Channels Last memory format orders data differently:
+Channels last memory format orders data differently:
 
 .. figure:: /_static/img/channels_last_memory_format.png
    :alt: channels_last_memory_format
 
 Pytorch supports memory formats (and provides back compatibility with existing models including eager, JIT, and TorchScript) by utilizing  existing strides structure.
-For example, 10x3x16x16 batch in Channels Last format will have strides equal to (768, 1, 48, 3).
+For example, 10x3x16x16 batch in Channels last format will have strides equal to (768, 1, 48, 3).
 """
 
 ######################################################################
-# Channels Last memory format is implemented for 4D NCWH Tensors only.
+# Channels last memory format is implemented for 4D NCWH Tensors only.
 #
 
-import torch
-N, C, H, W = 10, 3, 32, 32
-
 ######################################################################
 # Memory Format API
 # -----------------------
@@ -39,31 +36,68 @@
 
 ######################################################################
 # Classic PyTorch contiguous tensor
+import torch
+N, C, H, W = 10, 3, 32, 32
 x = torch.empty(N, C, H, W)
 print(x.stride()) # Ouputs: (3072, 1024, 32, 1)
 
 ######################################################################
 # Conversion operator
-x = x.contiguous(memory_format=torch.channels_last)
+x = x.to(memory_format=torch.channels_last)
 print(x.shape) # Outputs: (10, 3, 32, 32) as dimensions order preserved
 print(x.stride()) # Outputs: (3072, 1, 96, 3)
 
 ######################################################################
 # Back to contiguous
-x = x.contiguous(memory_format=torch.contiguous_format)
+x = x.to(memory_format=torch.contiguous_format)
 print(x.stride()) # Outputs: (3072, 1024, 32, 1)
 
 ######################################################################
 # Alternative option
-x = x.to(memory_format=torch.channels_last)
+x = x.contiguous(memory_format=torch.channels_last)
 print(x.stride()) # Ouputs: (3072, 1, 96, 3)
 
 ######################################################################
 # Format checks
 print(x.is_contiguous(memory_format=torch.channels_last)) # Ouputs: True
 
 ######################################################################
-# Create as Channels Last
+# There are minor difference between the two APIs ``to`` and
+# ``contiguous``. We suggest to stick with ``to`` when explicitly
+# converting memory format of tensor.
+#
+# For general cases the two APIs behave the same. However in special
+# cases for a 4D tensor with size ``NCHW`` when either: ``C==1`` or
+# ``H==1 && W==1``, only ``to`` would generate a proper stride to
+# represent channels last memory format.
+#
+# This is because in either of the two cases above, the memory format
+# of a tensor is ambiguous, i.e. a contiguous tensor with size
+# ``N1HW`` is both ``contiguous`` and channels last in memory storage.
+# Therefore, they are already considered as ``is_contiguous``
+# for the given memory format and hence ``contiguous`` call becomes a
+# no-op and would not update the stride. On the contrary, ``to``
+# would restride tensor with a meaningful stride on dimensions whose
+# sizes are 1 in order to properly represent the intended memory
+# format
+special_x = torch.empty(4, 1, 4, 4)
+print(special_x.is_contiguous(memory_format=torch.channels_last)) # Ouputs: True
+print(special_x.is_contiguous(memory_format=torch.contiguous_format)) # Ouputs: True
+
+######################################################################
+# Same thing applies to explicit permutation API ``permute``. In
+# special case where ambiguity could occur, ``permute`` does not
+# guarantee to produce a stride that properly carry the intended
+# memory format. We suggest to use ``to`` with explicit memory format
+# to avoid unintended behavior.
+#
+# And a side note that in the extreme case, where three non-batch
+# dimensions are all equal to ``1`` (``C==1 && H==1 && W==1``),
+# current implementation cannot mark a tensor as channels last memory
+# format.
+
+######################################################################
+# Create as channels last
 x = torch.empty(N, C, H, W, memory_format=torch.channels_last)
 print(x.stride()) # Ouputs: (3072, 1, 96, 3)
 
@@ -89,24 +123,41 @@
 print(z.stride()) # Ouputs: (3072, 1, 96, 3)
 
 ######################################################################
-#  Conv, Batchnorm modules support Channels Last
-#  (only works for CudNN >= 7.6)
+# Conv, Batchnorm modules using cudnn backends support channels last
+# (only works for CudNN >= 7.6). Convolution modules, unlike binary
+# p-wise operator, have channels last as the dominating memory format.
+# IFF all inputs are in contiguous memory format, the operator
+# produces output in contiguous memory format. Otherwise, output wil
+# be in channels last memroy format.
+
 if torch.backends.cudnn.version() >= 7603:
-    input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, device="cuda", requires_grad=True)
-    model = torch.nn.Conv2d(8, 4, 3).cuda().float()
+    model = torch.nn.Conv2d(8, 4, 3).cuda().half()
+    model = model.to(memory_format=torch.channels_last) # Module parameters need to be channels last
 
-    input = input.contiguous(memory_format=torch.channels_last)
-    model = model.to(memory_format=torch.channels_last) # Module parameters need to be Channels Last
+    input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, requires_grad=True)
+    input = input.to(device="cuda", memory_format=torch.channels_last, dtype=torch.float16)
 
     out = model(input)
     print(out.is_contiguous(memory_format=torch.channels_last)) # Ouputs: True
 
+######################################################################
+# When input tensor reaches a operator without channels last support,
+# a permutation should automatically apply in the kernel to restore
+# contiguous on input tensor. This introduces overhead and stops the
+# channels last memory format propagation. Nevertheless, it guarantees
+# correct output.
+
 ######################################################################
 # Performance Gains
-# -------------------------------------------------------------------------------------------
-# The most significant performance gains are observed on Nvidia's hardware with
-# Tensor Cores support. We were able to archive over 22%  perf gains while running '
-# AMP (Automated Mixed Precision) training scripts supplied by Nvidia https://github.com/NVIDIA/apex.
+# --------------------------------------------------------------------
+# The most significant performance gains are observed on NVidia's
+# hardware with Tensor Cores support running on reduced precision
+# (``torch.float16``).
+# We were able to archive over 22% perf gains with channels last
+# comparing to contiguous format, both while utilizing
+# 'AMP (Automated Mixed Precision)' training scripts.
+# Our scripts uses AMP supplied by NVidia
+# https://github.com/NVIDIA/apex.
 #
 # ``python main_amp.py -a resnet50 --b 200 --workers 16 --opt-level O2  ./data``
 
@@ -143,8 +194,8 @@
 # Epoch: [0][80/125] Time 0.260 (0.335) Speed 770.324 (597.659) Loss 2.2505953312 (1.0879) Prec@1 50.500 (52.938) Prec@5 100.000 (100.000)
 
 ######################################################################
-# Passing ``--channels-last true`` allows running a model in Channels Last format with observed 22% perf gain.
-#
+# Passing ``--channels-last true`` allows running a model in Channels last format with observed 22% perf gain.
+# 
 # ``python main_amp.py -a resnet50 --b 200 --workers 16 --opt-level O2 --channels-last true ./data``
 
 # opt_level = O2
@@ -184,15 +235,18 @@
 # Epoch: [0][80/125] Time 0.198 (0.269) Speed 1011.827 (743.883) Loss 2.8196096420 (2.4011) Prec@1 47.500 (50.938) Prec@5 100.000 (100.000)
 
 ######################################################################
-# The following list of models has the full support of Channels Last and showing 8%-35% perf gains on Volta devices:
+# The following list of models has the full support of Channels last and showing 8%-35% perf gains on Volta devices:
 # ``alexnet``, ``mnasnet0_5``, ``mnasnet0_75``, ``mnasnet1_0``, ``mnasnet1_3``, ``mobilenet_v2``, ``resnet101``, ``resnet152``, ``resnet18``, ``resnet34``, ``resnet50``, ``resnext50_32x4d``, ``shufflenet_v2_x0_5``, ``shufflenet_v2_x1_0``, ``shufflenet_v2_x1_5``, ``shufflenet_v2_x2_0``, ``squeezenet1_0``, ``squeezenet1_1``, ``vgg11``, ``vgg11_bn``, ``vgg13``, ``vgg13_bn``, ``vgg16``, ``vgg16_bn``, ``vgg19``, ``vgg19_bn``, ``wide_resnet101_2``, ``wide_resnet50_2``
 #
 
 ######################################################################
 # Converting existing models
 # --------------------------
 #
-# Channels Last support not limited by existing models, as any model can be converted to Channels Last and propagate format through the graph as soon as input formatted correctly.
+# Channels last support is not limited by existing models, as any
+# model can be converted to channels last and propagate format through
+# the graph as soon as input (or certain weight) is formatted
+# correctly.
 #
 
 # Need to be done once, after model initialization (or load)
@@ -203,9 +257,22 @@
 output = model(input)
 
 #######################################################################
-# However, not all operators fully converted to support Channels Last (usually returning
-# contiguous output instead). That means you need to verify the list of used operators
-# against supported operators list https://github.com/pytorch/pytorch/wiki/Operators-with-Channels-Last-support,
+# However, not all operators fully converted to support channels last
+# (usually returning contiguous output instead). In the example posted
+# above, layers that does not support channels last will stop the
+# memory format propagation. In spite of that, as we have converted the
+# model to channels last format, that means each convolution layer,
+# which has its 4 dimensional weight in channels last memory format,
+# will restore channels last memory format and benefit from faster
+# kernels.
+#
+# But operatos that does not support channels last does introduce
+# overhead by permutation. Optionally, you can investigate and identify
+# operatos in your model that does not support channels last, if you
+# want to improve the performance of converted model.
+#
+# That means you need to verify the list of used operators 
+# against supported operators list https://github.com/pytorch/pytorch/wiki/Operators-with-Channels-Last-support, 
 # or introduce memory format checks into eager execution mode and run your model.
 #
 # After running the code below, operators will raise an exception if the output of the
@@ -284,8 +351,8 @@ def attribute(m):
 
 
 ######################################################################
-# If you found an operator that doesn't support Channels Last tensors
-# and you want to contribute, feel free to use following developers
+# If you found an operator that doesn't support channels last tensors
+# and you want to contribute, feel free to use following developers 
 # guide https://github.com/pytorch/pytorch/wiki/Writing-memory-format-aware-operators.
 #