-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Register ops to AutocastCPU #4412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a810d89 to
72fd957
Compare
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CaoE , I just took a brief look. For now I'm just curious, what is you use-case for supporting autocast on CPU?
test/test_ops.py
Outdated
|
|
||
| @pytest.mark.parametrize('x_dtype', (torch.float, torch.half)) | ||
| @pytest.mark.parametrize('rois_dtype', (torch.float, torch.half)) | ||
| def test_autocast_cpu(self, x_dtype, rois_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of creating a new test, maybe we could just parametrize over the device with the cpu_and_gpu() function?
Since the context manager is device-dependent, we could just set it in the code, like
cm = torch.cpu.amp.autocast if device == 'cpu' else torch.cuda.amp.autocast
with cm():
self.test_forward(torch.device(device), contiguous=False, x_dtype=x_dtype, rois_dtype=rois_dtype)We could do the same for the rest of the newly introduced tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea, I will modify it like this.
|
Because we want to use deep learning models on cpu servers, but torchvision ops like nms will report errors when using BFloat16. |
ee6fa35 to
8367b5a
Compare
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test/test_ops.py
Outdated
| self.test_nms_cuda(iou=iou, dtype=dtype) | ||
| @pytest.mark.parametrize("dtype", (torch.bfloat16, torch.half)) | ||
| def test_autocast(self, device, iou, dtype): | ||
| def test_nms_cpu(iou, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating this new one here, do you think we could just rely on test_nms_ref instead?
It doesn't accept a dtype parameter so we could add it if it's relevant, or alternatively we can just define test_fn as a partial function.
No strong opinion on this, we can also leave as is, even though it's a bit unfortunate that we have to define a new test_nms_cpu here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating this new one here just to check whether the input is converted to float type. Is your suggestion to use test_nms_ref instread of test_nms_cpu here? It's all ok for me. Thank you so much for your suggestion @NicolasHug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_nms_ref instread of test_nms_cpu here?
Yes :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will modify it later.
datumbox
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes on the C++ overall look good to me. I would advise testing them on FBcode and updating buck before merging to ensure we won't break anything internally.
My only concern, and it's not linked to this PR, is the amount of duplicate and boilerplate code we are forced to add in the repo to handle the registrations.
@ezyang I know that a year ago you've been working on improving the op registration. Is there a better way to do it?
setup.py
Outdated
| source_cuda = glob.glob(os.path.join(extensions_dir, 'ops', 'cuda', '*.cu')) | ||
|
|
||
| source_cuda += glob.glob(os.path.join(extensions_dir, 'ops', 'autocast', '*.cpp')) | ||
| source_cuda += glob.glob(os.path.join(extensions_dir, 'ops', 'autocast', 'cuda', '*.cpp')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buck will need changes due to this. Might be worth bringing the PR in FBcode prior merging to ensure it does not break anything.
|
If I understand it correctly, the only change between the CPU and CUDA folders is:
Is there any other difference, or do we expect it to potentially diverge more over time? |
test/test_ops.py
Outdated
| with torch.cuda.amp.autocast(): | ||
| self.test_forward(torch.device("cuda"), contiguous=False, x_dtype=x_dtype, rois_dtype=rois_dtype) | ||
| def test_autocast(self, device, x_dtype, rois_dtype): | ||
| cm = torch.cpu.amp.autocast if device == 'cpu' else torch.cuda.amp.autocast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone should file a bug asking for torch.amp.autocast that accepts a device argument lol
I think the most direct way to reduce boilerplate is if you can reuse the templates/macros that are used inside PyTorch core to setup autocasting. I haven't checked if you are doing unusual autocasting so that these aren't applicable. We don't have codegen support for autocasting so that's right out, and no one has written a boxed fallback for autocasting (maybe someone should!) |
53c7e00 to
83293bf
Compare
|
@CaoE Are we good to merge? |
yes, I have nothing else to commit. |
|
Awesome, let us run some additional tests on the internal FB infra and we will merge right after. |
datumbox
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking this as "changes needed" to avoid accidental merges prior testing it on FBcode.
@CaoE No further action is needed on your side, you are good to go. Just give us a bit more time to test things :)
fmassa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking as "request changes" so that we don't forget to answer my questions on the past post.
While we can have duplicate code here initially, it would be good for us to understand if we plan to keep those two sets of files practically identical or if further changes that can lead to divergence in the CPU / CUDA files are expected to happen.
This is important because we can think about follow-up work that will refactor this redundancy out, improving maintainance cost in the future, but only if we know that the changes will be kept to the ones I pointed out in my previous comment.
|
Hi @datumbox @NicolasHug @ezyang @fmassa , may I know when the PR can be merged, and whether it can catch into the torchvision branch for PyTorch 1.10 release? Thank you very much. |
|
@CaoE Note that the PR has conflicts with main that need to be resolved. Some of them can be addressed as described here #4539 but I think it might be easier to fix manually. Also could you please provide clarifications on this #4412 (comment)? |
|
@CaoE we were waiting for your clarifications on my questions before moving forward merging this PR. Also, the branch cut / freeze was on Friday, so it might be hard getting those changes into the 0.11 release. |
|
@fmassa @datumbox Sorry for missing the questions and thank you for the detailed explanation.
Yes.
|
* modify the directory structure: moved the autocast files from torchvision/csrc/ops/autocast/ to torchvision/csrc/ops/autocast/cuda * add the cpu directory under the autocast directory; * register deform_conv2d, nms, ps_roi_align, ps_roi_pool, roi_align, and roi_pool to AutocastCPU.
modify the directory structure: moved the autocast files from torchvision/csrc/ops/autocast/ to torchvision/csrc/ops/autocast/cuda
add the cpu directory under the autocast directory;
register deform_conv2d, nms, ps_roi_align, ps_roi_pool, roi_align, and roi_pool to AutocastCPU.