-
Notifications
You must be signed in to change notification settings - Fork 199
Reduce overhead of bindings requiring cuPythonInit()
#894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the initialization check for CUDA Python bindings by reducing function call overhead. The optimization splits the existing cuPythonInit()
and cudaPythonInit()
functions into two parts: a small wrapper function that checks if initialization has already occurred, and a larger function that performs the actual initialization work.
- Refactors initialization functions to use a small wrapper pattern for better compiler inlining
- Moves the initialization flag check to a separate small function to enable C compiler optimization
- Applies the same pattern across three binding files for consistency
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
cyruntime.pyx.in | Splits cudaPythonInit() into wrapper and implementation functions |
cynvrtc.pyx.in | Splits cuPythonInit() into wrapper and implementation functions |
cydriver.pyx.in | Splits cuPythonInit() into wrapper and implementation functions |
Comments suppressed due to low confidence (1)
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
/ok to test |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add a release note to 13.X.Y. (I want to backport it to 12.9.X, so perhaps also a good idea to touch its release note. Note that we only generate docs on the main branch #809.)
Also we need to backport this to the codegen 🙂
👍
Sure, will do. The upstream of the generator at |
I guess you're blocked by me now... let me get to it asap |
/ok to test 4c6a057 |
Description
This reduces the calling overhead of binding functions that require
cuPythonInit
orcudaPythonInit
to be called.This reduces the time it takes to call
driver.cuDeviceGet(0)
(for example) by about 50ns (on my machine):Note that these times include the work done by the actual underlying
cuDeviceGet
call, not just the function call overhead.Why this works
The
cuPythonInit
function is extremely large, so no C compiler is likely to ever inline it. By creating a small wrapper function just to check theinit
flag and then delegate to the big function, the C compiler inlines it. This not only removes a C function call, but probably helps out the branch predictor when checking the flag.closes
Checklist