Skip to content

Reduce overhead of bindings requiring cuPythonInit() #894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mdboom
Copy link
Contributor

@mdboom mdboom commented Aug 22, 2025

Description

This reduces the calling overhead of binding functions that require cuPythonInit or cudaPythonInit to be called.

This reduces the time it takes to call driver.cuDeviceGet(0) (for example) by about 50ns (on my machine):

  • Base: Mean +- std dev: 149 ns +- 8 ns
  • This PR: Mean +- std dev: 101 ns +- 2 ns

Note that these times include the work done by the actual underlying cuDeviceGet call, not just the function call overhead.

Why this works

The cuPythonInit function is extremely large, so no C compiler is likely to ever inline it. By creating a small wrapper function just to check the init flag and then delegate to the big function, the C compiler inlines it. This not only removes a C function call, but probably helps out the branch predictor when checking the flag.

closes

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link
Contributor

copy-pr-bot bot commented Aug 22, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@mdboom mdboom requested review from Copilot and leofang August 22, 2025 16:36
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the initialization check for CUDA Python bindings by reducing function call overhead. The optimization splits the existing cuPythonInit() and cudaPythonInit() functions into two parts: a small wrapper function that checks if initialization has already occurred, and a larger function that performs the actual initialization work.

  • Refactors initialization functions to use a small wrapper pattern for better compiler inlining
  • Moves the initialization flag check to a separate small function to enable C compiler optimization
  • Applies the same pattern across three binding files for consistency

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
cyruntime.pyx.in Splits cudaPythonInit() into wrapper and implementation functions
cynvrtc.pyx.in Splits cuPythonInit() into wrapper and implementation functions
cydriver.pyx.in Splits cuPythonInit() into wrapper and implementation functions
Comments suppressed due to low confidence (1)

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@mdboom
Copy link
Contributor Author

mdboom commented Aug 22, 2025

/ok to test

Copy link

@leofang leofang added enhancement Any code-related improvements P1 Medium priority - Should do cuda.bindings Everything related to the cuda.bindings module to-be-backported Trigger the bot to raise a backport PR upon merge labels Aug 22, 2025
@leofang leofang added this to the cuda-python parking lot milestone Aug 22, 2025
Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a release note to 13.X.Y. (I want to backport it to 12.9.X, so perhaps also a good idea to touch its release note. Note that we only generate docs on the main branch #809.)

Also we need to backport this to the codegen 🙂

@mdboom
Copy link
Contributor Author

mdboom commented Aug 22, 2025

Let's also add a release note to 13.X.Y. (I want to backport it to 12.9.X, so perhaps also a good idea to touch its release note. Note that we only generate docs on the main branch #809.)

👍

Also we need to backport this to the codegen 🙂

Sure, will do. The upstream of the generator at [email protected]:NVIDIA/cuda-python-private.git doesn't seem to have the free-threading fix yet (so running the generator for this change reverts all that). Is that just because that fix hasn't been implemented in the generator yet, or am I using the wrong branch or something on my end?

@leofang
Copy link
Member

leofang commented Aug 22, 2025

I guess you're blocked by me now... let me get to it asap

@github-project-automation github-project-automation bot moved this from Todo to In Review in CCCL Aug 22, 2025
@leofang
Copy link
Member

leofang commented Aug 22, 2025

/ok to test 4c6a057

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.bindings Everything related to the cuda.bindings module enhancement Any code-related improvements P1 Medium priority - Should do to-be-backported Trigger the bot to raise a backport PR upon merge
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

2 participants