-
Notifications
You must be signed in to change notification settings - Fork 797
[CUDA] P2P buffer/image memory copy #4401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
4a71379
c771093
b4a99ff
b38d5e2
6abe9fb
a3c251e
3cd6911
c384fbe
27c073d
60f276d
835e5c4
9e3ef85
0f36b94
0d819c0
43f970c
9b9a21f
52cae9f
24b240a
99e58f1
fd15910
ccba446
0f95aa7
7057849
9d5a84a
b64484d
ebfdb2a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -137,4 +137,9 @@ _PI_API(piextPluginGetOpaqueData) | |
|
||
_PI_API(piTearDown) | ||
|
||
_PI_API(piextEnqueueMemBufferCopyPeer) | ||
_PI_API(piextEnqueueMemBufferCopyRectPeer) | ||
_PI_API(piextEnqueueMemImageCopyPeer) | ||
_PI_API(piextDevicesSupportP2P) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we re-use existing piDeviceGetInfo with a new PI_DEVICE_INFO_P2P_DEVICES, which would return all devices that have P2P connection to this device? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that there is any cuda/hip driver API that takes a device as an argument and returns information on P2P connection status with all other devices. See https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html for the cuda API's. I think that any plugin API attempting to return all devices that have P2P connection to a given device would require a list of all other devices (or equivalent) as an argument. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think so either, but you should be able to get back to the platform from your PI device and get a list of all devices, then you can check for each device if access is allowed, or in the case of the CUDA backend just return the same list without the corresponding device in it. As an additional note, maybe in that case it would make sense to make it a bit more granular, e.g. having two new queries There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That sounds fine in the case that P2P copies are forbidden between devices not sharing the same platform. If cross context P2P copies are allowed wouldn't we need to access a list of all platforms? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question. I'm not convinced that cross-platform P2P should be allowed, but I suppose a backend can make that decision. Currently the CUDA backend only has one platform so it should be sufficient to just use that one. If at some point the CUDA backend introduces more platforms and P2P should be allowed between them, calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK I see. Sure I can do it like that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've addressed this now. If the change is acceptable, and providing we keep the cross context memory copy runtime impl, then I will implement piDeviceGetInfo for the other PI's that don't have it, since all PI's will call it now in graph_builder. |
||
|
||
#undef _PI_API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need these new API? why wouldn't regular copy API perform P2P copies transparently under the hood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regular API takes a single pi_queue whereas the Peer API requires a second queue as an argument (principally so that the second context is known). The regular API is an OpenCL interface so cannot be changed. I think that a single API could be used if the new piext*** API was used in the runtime to replace the regular copy API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pi_mem src & dst are created with interfaces that have context, e.g. piextUSMDeviceAlloc or piMemBufferCreate, so backends already know the context of both src and dst, and can act on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I missed that when I checked out pi_mem. There is another reason for providing both queues from e.g. this snippet in
cuda_piextEnqueueMemBufferCopyRectPeer in pi_cuda.cpp line 4054:
We wait on events associated with the source queue and return the event associated with the dest queue.
There were problems associated with returning the event associated with the src queue. Since the contexts can be found from pi_mem I will look at the again and see if there is a way of doing things without the second queue argument.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that it is fine to only pass a single queue in the PI which acts as the command_queue, and it is fine for either the src_queue or the dst_queue to act as the command queue.
It is my current understanding that all implementation details of implicit peer to peer memory copy calls for buffer memory between devices sharing a SYCL context should be dealt with by the PI, such that the only implicit peer to peer memory copy case that should be dealt with by the runtime (via memory_manager) is the cross context case.
I will implement the peer to peer via a call to piEnqueueMemBufferCopy from memory_manager as suggested.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now made the changes that I described above, implementing the peer to peer copy via a call to piEnqueueMemBufferCopy from memory_manager as suggested.