Skip to content

Conversation

@pmeier
Copy link
Contributor

@pmeier pmeier commented Oct 24, 2022

The v1 kernel includes a lot of checks that aren't useful:

_assert_image_tensor(img)
if img.ndim < 3:
raise TypeError(f"Input image tensor should have at least 3 dimensions, but found {img.ndim}")
if img.dtype != torch.uint8:
raise TypeError(f"Only torch.uint8 image tensors are supported, but found {img.dtype}")
_assert_channels(img, [1, 3])

  • we are already inside the tensor kernel so there is no need to assert that again
  • this kernel works with arbitrary channels so there is no need to enforce {1, 3}

Other than that, there is unfortunately no further optimization possible. Similar to #6819 (comment) it seems that using scalars with out of place ops is faster than manually putting them into a tensor and using inplace ops:

import itertools

import torch
from torch.utils import benchmark


shapes = [
    (3, 256, 256),  # single image
    (5, 3, 256, 256),  # single video
]
devices = ["cpu", "cuda"]


def scalar_and(image, bits):
    mask = -int(2 ** (8 - bits))
    return mask & image


def full_like(image, bits):
    mask = torch.full_like(image, -int(2 ** (8 - bits)))
    return mask.bitwise_and_(image)


timers = [
    benchmark.Timer(
        stmt="fn(input, bits)",
        globals=dict(
            fn=fn,
            input=torch.testing.make_tensor(shape, dtype=torch.uint8, device=device, low=0, high=256),
            bits=4,
        ),
        label="posterize",
        sub_label=f"{shape!s:16} / {device:4}",
        description=fn.__name__,
    )
    for fn, shape, device in itertools.product([scalar_and, full_like], shapes, devices)
    for num_threads in ([1, 2, 4] if device == "cpu" else [1])
]

measurements = [timer.blocked_autorange(min_run_time=5) for timer in timers]

comparison = benchmark.Compare(measurements)
comparison.trim_significant_figures()
comparison.print()
[---------------------- posterize -----------------------]
                               |  scalar_and  |  full_like
1 threads: -----------------------------------------------
      (3, 256, 256)    / cpu   |      28      |      95   
      (3, 256, 256)    / cuda  |       5      |       7   
      (5, 3, 256, 256) / cpu   |     119      |     466   
      (5, 3, 256, 256) / cuda  |       5      |       7   

Times are in microseconds (us).

Nevertheless, I've run the benchmark from #6818 to see if the extra checks have an effect on the performance:

[------- posterize @ torchvision==0.15.0a0+62da7d4 --------]
                                            |   v1   |   v2 
1 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    37  |    35
      (1, 512, 512)       / uint8   / cuda  |     8  |     5
      (3, 512, 512)       / uint8   / cpu   |    97  |    93
      (3, 512, 512)       / uint8   / cuda  |     9  |     5
      (5, 3, 512, 512)    / uint8   / cpu   |   446  |   464
      (5, 3, 512, 512)    / uint8   / cuda  |    34  |    35
      (4, 5, 3, 512, 512) / uint8   / cpu   |  1900  |  1800
      (4, 5, 3, 512, 512) / uint8   / cuda  |   130  |   130
2 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    27  |    23
      (3, 512, 512)       / uint8   / cpu   |    59  |    54
      (5, 3, 512, 512)    / uint8   / cpu   |   243  |   240
      (4, 5, 3, 512, 512) / uint8   / cpu   |   940  |   938
4 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    18  |    15
      (3, 512, 512)       / uint8   / cpu   |    36  |    32
      (5, 3, 512, 512)    / uint8   / cpu   |   130  |   127
      (4, 5, 3, 512, 512) / uint8   / cpu   |   480  |   480

Times are in microseconds (us).

As expected, there is no performance difference. Any measured difference is just noise.

datumbox
datumbox previously approved these changes Oct 24, 2022
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@datumbox
Copy link
Contributor

@pmeier Given there is no performance boost, we can consider deferring merging until later to reduce copy-pasted code. This is one potential way to handle kernels that can't be optimized further (continue aliasing). No strong opinion, I'll leave it to you whether you want to merge or not.

@datumbox datumbox dismissed their stale review October 24, 2022 18:05

Doesn't improve performance, we should discuss aliasing.

@pmeier
Copy link
Contributor Author

pmeier commented Oct 27, 2022

Superseded by #6847.

@pmeier pmeier closed this Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants