-
Notifications
You must be signed in to change notification settings - Fork 2.7k
deps: upgrade to PyTorch 2.0 (replaces xformers) #2962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
So this triples my speeds on a 4090.
Unfortunately, even with a 24GB VRAM card, I get OOM errors when decoding large images - like around 1600 x 1600 and up. At some point in the past, I think before However! With #2920, I can now generate absolutely gargantuan images. There are some artifacts due to the tiling, but a 3072 x 3072 uses under 8GB VRAM. Happy camper here. Thanks! |
|
Can confirm what @psychedelicious reported. I get nearly 3x speeds on an RTX 3080 laptop GPU on Windows too. But I have to note that this speed boost keeps depreciating as I keep generating more and more images. |
|
Further testing. I installed |
|
Putting some thoughts and testing results here in this PR. With some brief testing, you get that performance boost but also non-deterministic behavior. None of the options available (subject to change, according to the docs) allow us to reproduce images made with pre-2.0 pytorch, however it looks like we may be able to get determinism with: I think this makes it not quite as fast as what you have above. I'd love other people to try that out in place of We also need to look into whether we can disable memory efficient attention for non-CUDA or if we have to leave that code alone. |
|
Do we need to micromanage each possible implementation like that, or is it sufficient to use |
Maybe we can get away with that or |
|
Here's what I get: |
|
After setting determinism on with It looks like we can potentially yank out the attention slicing code but that causes images to be different between torch 2.0 and pre-torch 2.0. |
|
On the plus side, even with those deterministic algorithms in use, I can now generate really large images until I get to the decoding step. |
|
looks like the current anchor for that section of the cuBLAS docs is a bit different: https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility Having to set the environment variable is a bit awkward. Can you tell if it's something that needs to be set before the library is initialized, or can we (re)set it on the fly at runtime? Wouldn't be so bad if we could do it that way. |
|
Maybe somewhere in |
|
If we cannot change these settings on the fly, can we explore wrapping the initialization of torch so we can re-do that if these settings are changed, without having to fully quit the application? |
Why would the speed increase decline? Is there a memory leak? |
|
Tested on a ROCm system: good news: Renders a nearly identical "banana sushi" to 1.13. Differences are subtle and about the same as generation-to-generation variances with |
|
I tested in a CUDA system (NVIDIA RTX A2000 12GB) just now and the performance of 1.13+xformers is equal to 2.0.0 without xformers. No 3x speedup in my hands, unfortunately! |
|
@lstein did you comment out the call to |
It is a bit tricky to use environment variables to configure python libraries since the environment variable needs to be set before the first If we need to set a bunch of environment variables, then I would suggest that we make a new .py file with all the environment setting code in it. Alternatively we change the code execution so that the command-line arguments are parsed early on; this would enable us to make the environment variable settings a feature of the Something similar has to be done for |
It's already commented out in the PR. |
I also tried running with 2.0.0 and |
For me, the improvements were basically at the same across all resolutions (about 3x faster). I did not notice any degradation in performance over time, but that probably need more careful testing - just going from memory here. I think you've tried all of the permutations then - maybe the improvements are related to the new cu118? Could be it has improvements for only certain platforms (A2000 looks to be "ampere" while eg 3000/4000 series I think are "lovelace"). |
|
I'll run through some tests and see if the degradation persists. But in either case, I think upgrading to Pytorch 2 is a no brainer once we have all the roadblocks resolved. |
|
There's been no activity on the PR for several days. Seems to me we should just go ahead with this? |
I thought you disliked non-deterministic behavior? I think we also need to resolve whether we should keep the slicing code as per the comments above for the case where the user wants determinism - which we can do but we need to address. IMO this is not ready for prime time and we should lock things in at |
lstein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works in my hands.
|
Is anyone still working on this? Otherwise it's going to get left behind. |
|
I've been using 2.0.0 since it released. But I am also using xformers together with it because I get much faster results. There's obviously determinism issues but those exist with xformers too. So maybe we keep the attention on for now and upgrade to 2.0 and also enforce xformers to the latest version so it is compatible with the new torch. |
|
Can we add a new flag to disable the slicing? I'd rather get the massive speed boost than have deterministic results most of the time. |
|
@psychedelicious @lstein We need to do more than that just to get deterministic behavior - and we should have the option to do so. See above. All of this makes me uncomfortable that we'll lose reproducibility; isn't that important for the audit trail we want to have for results? |
|
@lstein - I believe we've addressed this concern w/ latest xformers update. Good to close? |
PyTorch 2.0 is released!
It provides a direct interface to several optimized implementations of scaled dot-product attention so we don't need to explicitly depend on xformers or triton anymore.
Fixes #2405
I did some quick and dirty tests here on Linux/CUDA (RTX 3060) and it seems to work in this environment.
To Do
_adjust_memory_efficient_attentionmethod. Is it entirely obsolete now that pytorch has a C++ cross-platform implementation of scaled dot product attention to fall back on, or will we still need that?torch.backends.cudnn.deterministicas per notes on avoiding nondeterministic algorithms.