deps: upgrade to PyTorch 2.0 (replaces xformers) #2962

keturn · 2023-03-15T22:54:42Z

It provides a direct interface to several optimized implementations of scaled dot-product attention so we don't need to explicitly depend on xformers or triton anymore.

Fixes #2405

I did some quick and dirty tests here on Linux/CUDA (RTX 3060) and it seems to work in this environment.

To Do

test on Windows
test on MPS
test on ROCm
figure out what to do with our _adjust_memory_efficient_attention method. Is it entirely obsolete now that pytorch has a C++ cross-platform implementation of scaled dot product attention to fall back on, or will we still need that?
provide some interface to torch.backends.cudnn.deterministic as per notes on avoiding nondeterministic algorithms.

psychedelicious · 2023-03-18T04:21:28Z

So this triples my speeds on a 4090.

Update to pytorch 2.0.0 and remove xformers: ~11it/s --> ~22it/s
Do not call _adjust_memory_efficient_attention: ~22it/s --> ~33it/s

Unfortunately, even with a 24GB VRAM card, I get OOM errors when decoding large images - like around 1600 x 1600 and up. At some point in the past, I think before diffusers, I could do over 2048 x 2048.

However! With #2920, I can now generate absolutely gargantuan images. There are some artifacts due to the tiling, but a 3072 x 3072 uses under 8GB VRAM.

Happy camper here. Thanks!

blessedcoolant · 2023-03-18T06:51:58Z

Can confirm what @psychedelicious reported. I get nearly 3x speeds on an RTX 3080 laptop GPU on Windows too. But I have to note that this speed boost keeps depreciating as I keep generating more and more images.

blessedcoolant · 2023-03-18T12:00:34Z

Further testing. I installed xformers 0.0.17 dev and let the _adjust_memory_efficient_attention along with Torch 2. The generation speeds are even better. I went up from 2it/s to 6it/s now for a 512x768 image.

JPPhoto · 2023-03-18T13:04:44Z

Putting some thoughts and testing results here in this PR.

With some brief testing, you get that performance boost but also non-deterministic behavior. None of the options available (subject to change, according to the docs) allow us to reproduce images made with pre-2.0 pytorch, however it looks like we may be able to get determinism with:

torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_math_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(False)

I think this makes it not quite as fast as what you have above. I'd love other people to try that out in place of _adjust_memory_efficient_attention. I'm also curious what diffusers will do w/rt pytorch 2.0.

We also need to look into whether we can disable memory efficient attention for non-CUDA or if we have to leave that code alone.

keturn · 2023-03-18T15:48:44Z

Do we need to micromanage each possible implementation like that, or is it sufficient to use torch.backends.cudnn.deteriministic = True? https://pytorch.org/docs/master/notes/randomness.html#cuda-convolution-determinism

JPPhoto · 2023-03-18T17:48:56Z

Do we need to micromanage each possible implementation like that, or is it sufficient to use torch.backends.cudnn.deteriministic = True? https://pytorch.org/docs/master/notes/randomness.html#cuda-convolution-determinism

Maybe we can get away with that or torch.use_deterministic_algorithms(True). I don't know what that does from a performance perspective but I can toy around with it. And if it is deterministic after setting either of those, then we have to see what effect that has on memory and attention slicing. I'll investigate when I can.

JPPhoto · 2023-03-18T17:52:38Z

Here's what I get:

RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

JPPhoto · 2023-03-18T17:58:32Z

After setting determinism on with torch.use_deterministic_algorithms(True) and doing as the error message suggests, images generated in succession are identical to each other, and image generation times look to be maybe a bit faster. The far better times come at the expense of reproducibility.

It looks like we can potentially yank out the attention slicing code but that causes images to be different between torch 2.0 and pre-torch 2.0.

JPPhoto · 2023-03-18T18:09:36Z

On the plus side, even with those deterministic algorithms in use, I can now generate really large images until I get to the decoding step.

keturn · 2023-03-18T18:45:38Z

looks like the current anchor for that section of the cuBLAS docs is a bit different: https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility

Having to set the environment variable is a bit awkward. Can you tell if it's something that needs to be set before the library is initialized, or can we (re)set it on the fly at runtime? Wouldn't be so bad if we could do it that way.

JPPhoto · 2023-03-18T23:04:59Z

Maybe somewhere in CLI.py (or another place/places), we do: os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"? As long as that hits before torch and friends load up, I imagine that will work.

psychedelicious · 2023-03-19T00:43:58Z

If we cannot change these settings on the fly, can we explore wrapping the initialization of torch so we can re-do that if these settings are changed, without having to fully quit the application?

lstein · 2023-03-23T18:31:40Z

Can confirm what @psychedelicious reported. I get nearly 3x speeds on an RTX 3080 laptop GPU on Windows too. But I have to note that this speed boost keeps depreciating as I keep generating more and more images.

Why would the speed increase decline? Is there a memory leak?

lstein · 2023-03-23T19:51:38Z

Tested on a ROCm system:

good news: Renders a nearly identical "banana sushi" to 1.13. Differences are subtle and about the same as generation-to-generation variances with xformers on a CUDA system. No variation from one image to the next when using 2.0.0 repeatedly.
disappointing news: No improvement in rendering speed
expected news: On the AMD GPU that I use, there is a 60s "warmup period" before rendering starts the very first time torch is called. After this, there is no delay, even when invokeai is killed and restarted. This is the same behavior I observed previously in 1.13 and it was fixed by recompiling pytorch from source.

lstein · 2023-03-23T21:50:39Z

I tested in a CUDA system (NVIDIA RTX A2000 12GB) just now and the performance of 1.13+xformers is equal to 2.0.0 without xformers. No 3x speedup in my hands, unfortunately!

psychedelicious · 2023-03-23T21:54:26Z

@lstein did you comment out the call to _adjust_memory_efficient_attention? Doing that was half the 200% improvement

lstein · 2023-03-23T22:02:29Z

Maybe somewhere in CLI.py (or another place/places), we do: os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"? As long as that hits before torch and friends load up, I imagine that will work.

It is a bit tricky to use environment variables to configure python libraries since the environment variable needs to be set before the first import torch statement. In the current code, we are already doing this at the top of CLI.py (this was edited a bit for clarity):

import os
import re
import shlex
import sys
import traceback
from argparse import Namespace
from pathlib import Path
from typing import Union
[more non-torch imports]

if sys.platform == "darwin":
    os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

from ...backend import Generate, ModelManager
[more backend imports]

If we need to set a bunch of environment variables, then I would suggest that we make a new .py file with all the environment setting code in it. Alternatively we change the code execution so that the command-line arguments are parsed early on; this would enable us to make the environment variable settings a feature of the invokeai.init file.

Something similar has to be done for api_app.py and cli_app.py for nodes.

lstein · 2023-03-23T22:03:19Z

@lstein did you comment out the call to _adjust_memory_efficient_attention? Doing that was half the 200% improvement

It's already commented out in the PR.

lstein · 2023-03-23T22:09:51Z

@lstein did you comment out the call to _adjust_memory_efficient_attention? Doing that was half the 200% improvement

It's already commented out in the PR.

I also tried running with 2.0.0 and xformers 0.0.17rc482, with and without _adjust_memory_efficient_attention and am not seeing any effect on rendering speeds. This is all with 512x512 images and stable-diffusion-1.5. Are the improvements more dramatic with larger images?

psychedelicious · 2023-03-24T03:56:13Z

I also tried running with 2.0.0 and xformers 0.0.17rc482, with and without _adjust_memory_efficient_attention and am not seeing any effect on rendering speeds. This is all with 512x512 images and stable-diffusion-1.5. Are the improvements more dramatic with larger images?

For me, the improvements were basically at the same across all resolutions (about 3x faster). I did not notice any degradation in performance over time, but that probably need more careful testing - just going from memory here.

I think you've tried all of the permutations then - maybe the improvements are related to the new cu118? Could be it has improvements for only certain platforms (A2000 looks to be "ampere" while eg 3000/4000 series I think are "lovelace").

blessedcoolant · 2023-03-24T03:59:23Z

I'll run through some tests and see if the degradation persists. But in either case, I think upgrading to Pytorch 2 is a no brainer once we have all the roadblocks resolved.

lstein · 2023-04-02T15:17:22Z

There's been no activity on the PR for several days. Seems to me we should just go ahead with this?

JPPhoto · 2023-04-06T17:01:12Z

There's been no activity on the PR for several days. Seems to me we should just go ahead with this?

I thought you disliked non-deterministic behavior? I think we also need to resolve whether we should keep the slicing code as per the comments above for the case where the user wants determinism - which we can do but we need to address. IMO this is not ready for prime time and we should lock things in at torch~=1.13.1 until such time as we figure it out.

lstein

Works in my hands.

lstein · 2023-04-07T01:49:01Z

Is anyone still working on this? Otherwise it's going to get left behind.

blessedcoolant · 2023-04-07T02:11:33Z

I've been using 2.0.0 since it released. But I am also using xformers together with it because I get much faster results. There's obviously determinism issues but those exist with xformers too.

So maybe we keep the attention on for now and upgrade to 2.0 and also enforce xformers to the latest version so it is compatible with the new torch.

psychedelicious · 2023-04-07T05:07:25Z

Can we add a new flag to disable the slicing? I'd rather get the massive speed boost than have deterministic results most of the time.

JPPhoto · 2023-04-07T17:15:02Z

@psychedelicious @lstein We need to do more than that just to get deterministic behavior - and we should have the option to do so. See above. All of this makes me uncomfortable that we'll lose reproducibility; isn't that important for the audit trail we want to have for results?

hipsterusername · 2023-05-10T12:30:46Z

@lstein - I believe we've addressed this concern w/ latest xformers update. Good to close?

deps: upgrade to PyTorch 2.0 (replaces xformers)

e158ad8

keturn added the requirements label Mar 15, 2023

psychedelicious mentioned this pull request Mar 18, 2023

[bug]: Pytorch versions after 1.11 have a massive 50% performance reduction (CLI) #2936

Closed

1 task

Merge branch 'main' into dev/pytorch2

5dec5b6

lstein approved these changes Apr 7, 2023

View reviewed changes

Merge branch 'main' into dev/pytorch2

3c50448

blessedcoolant closed this Jul 19, 2023

deps: upgrade to PyTorch 2.0 (replaces xformers) #2962

deps: upgrade to PyTorch 2.0 (replaces xformers) #2962

Uh oh!

Conversation

keturn commented Mar 15, 2023 • edited by lstein Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

To Do

Uh oh!

psychedelicious commented Mar 18, 2023

Uh oh!

blessedcoolant commented Mar 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blessedcoolant commented Mar 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPPhoto commented Mar 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keturn commented Mar 18, 2023

Uh oh!

JPPhoto commented Mar 18, 2023

Uh oh!

JPPhoto commented Mar 18, 2023

Uh oh!

JPPhoto commented Mar 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPPhoto commented Mar 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keturn commented Mar 18, 2023

Uh oh!

JPPhoto commented Mar 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

psychedelicious commented Mar 19, 2023

Uh oh!

lstein commented Mar 23, 2023

Uh oh!

lstein commented Mar 23, 2023

Uh oh!

lstein commented Mar 23, 2023

Uh oh!

psychedelicious commented Mar 23, 2023

Uh oh!

lstein commented Mar 23, 2023

Uh oh!

lstein commented Mar 23, 2023

Uh oh!

lstein commented Mar 23, 2023

Uh oh!

psychedelicious commented Mar 24, 2023

Uh oh!

blessedcoolant commented Mar 24, 2023

Uh oh!

lstein commented Apr 2, 2023

Uh oh!

JPPhoto commented Apr 6, 2023

Uh oh!

lstein left a comment

Choose a reason for hiding this comment

Uh oh!

lstein commented Apr 7, 2023

Uh oh!

blessedcoolant commented Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

psychedelicious commented Apr 7, 2023

Uh oh!

JPPhoto commented Apr 7, 2023

Uh oh!

hipsterusername commented May 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

keturn commented Mar 15, 2023 •

edited by lstein

Loading

blessedcoolant commented Mar 18, 2023 •

edited

Loading

blessedcoolant commented Mar 18, 2023 •

edited

Loading

JPPhoto commented Mar 18, 2023 •

edited

Loading

JPPhoto commented Mar 18, 2023 •

edited

Loading

JPPhoto commented Mar 18, 2023 •

edited

Loading

JPPhoto commented Mar 18, 2023 •

edited

Loading

blessedcoolant commented Apr 7, 2023 •

edited

Loading