-
Notifications
You must be signed in to change notification settings - Fork 14.6k
Parallel runtime library design doc (PRIF) #76088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel runtime library design doc (PRIF) #76088
Conversation
1c0e485
to
e6bfa3f
Compare
e6bfa3f
to
2bace2e
Compare
ping. Just trying to see if anybody has had a chance to look at this or could recommend someone to review it. FYI, if you'd rather read formatted PDF instead of the Markdown, you can find a version here: https://dx.doi.org/10.25344/S4DG6S |
My (partial) review:
|
Put with notification is nigh impossible to implement efficiently as specified. Notification variables must be in coarrays. Having the only argument in You should implement the put+notify with a separate function that has a second |
Assuming I infer correctly that this function plus the raw operations are designed to optimize for shared-memory communication via load-store, there needs to be a high-level semantic overview of the purpose of these functions and their difference from the non-raw operations (which need to include strided variants, as noted above). If I do not infer correctly the use case, then I disagree with the design. Side note: I have implemented OpenSHMEM on top of MPI with the general case and specialization for the shared-memory path (https://github.com/jeffhammond/oshmpi/blob/master/src/shmem-internals.c). It is going to be important to support both use cases because we want coarrays to be competitive with shared-memory parallelism (e.g. OpenMP) when programs run within a shared-memory domain. |
FYI, if anyone would like to review the new revision of this document in pdf form, it can be found at http://doi.org/10.25344/S4501W |
Everyone: If you have feedback about specific APIs or portions of the document, please use the "Files Changed" Pane in the PR to attach comments on appropriate lines of the document.This will enable us to hold organized/threaded discussions about individual topics, which is not really feasible on this non-threaded Conversation tab. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With permission, copied over comments from @jeffhammond to specific parts in the document to establish a place to respond to each comment.
51505e2
to
26e1ed5
Compare
We've just produced a new revision of this document, for those interested in reviewing a nicely formatted version, the following link provides a PDF: The changes from the previous revision are:
|
9fa1eab
to
265f177
Compare
I am not a reviewer for this PR, and I usually don't intrude on flang community activity, but I have read the document and have a couple of suggestions to make, mostly from an optimization perspective. In brief: Target hardware for corray Fortran includes two important subsets: those targets whose interconnects admit direct load/store access to remote data, and those whose data transfers are driven by controlling an RDMA NIC's MMRs. A user who wants to optimize for a specific target interconnect, or class of interconnects, should be able to stipulate so with a command-line option, and might get better performance than they would from a compilation that supports any possible target interconnect. When the target interconnect is known at compilation time to be one of those that support load/store access to remote memory, it would be useful to have a runtime library interface to perform the necessary address calculation to compute a remote base address for a given coarray on a particular image, if that image is not known to have failed. This would allow an optimizer to amortize the cost of that calculation when multiple references to the same corray/image will follow, and would enable loop transformations to prefetch data and hide load latency. On the other hand, when the target interconnect is known at compilation time to be one that supports asynchronous transactions, it would similarly be useful to have runtime library interfaces to initiate asynchronous reads and await their completions, again for hiding load latency. (Other optimizations that might apply for these targets, such as software caching, can be left to the target's runtime library implementation, I think.) |
Hi @klausler - I’m part of the PRIF team at Berkeley Lab. Thanks for the great questions! @klausler said:
The actual hardware landscape is more complicated than you seem to imply. In particular, many modern HPC platforms demonstrate a combination of both characteristics you describe, i.e., cache-coherent load-store access between cores/processes running within a single physical memory domain (i.e. "intra-node") AND RDMA access across an explicit interconnect between physical domains (i.e. "inter-node"). In general we care about deployments where both classes of transport may be simultaneously active in a given job execution, and the transport distinction is not (in general) globally static, but instead depends on the physical placement of the images involved in a given communication operation. A purely "single-node" system, where all images happen to use an intra-node transport is just a special case of this more general situation. The overheads involved in initiating communication over an RDMA transport are generally on the order of microseconds (corresponding to thousands of cycles on a modern processor), and usually dwarf details like serial instructions for address arithmetic by several orders of magnitude. Amortizing constant-time "setup" overheads for RDMA communication is unlikely to be a fruitful optimization, and would require exposing non-portable interconnect-specific details to the PRIF client. However a load/store transport through hardware-managed shared memory tends to be orders of magnitude faster (in overhead and latency), and in this case overheads like address translation are expected to have a greater relative impact. Here the cost of redundant address calculations and even the cost of extra procedure calls may become significant relative to the cost of a load/store-based data transfer. As such, we agree that on such transports, there are opportunities for fruitful amortization of communication "setup" overheads. We envision eventually expanding PRIF with calls allowing the client to detect and take advantage of this situation when appropriate. This is explicitly documented in Section 7: Future Work. @klausler said:
We agree that on multi-node networks, asynchronous communication is an effective means to hide communication latency by overlapping it with computation or other communication. Our group has a long history of exploiting those types of communication optimizations in the context of other parallel programming models. PRIF currently lacks entry points for explicitly asynchronous communication operations, because we wanted to start with the simplest interface that would allow a complete and compliant implementation of Fortran’s multi-image parallel semantics. We would like to see future revisions of PRIF add extensions for explicitly asynchronous communication (especially for coindexed reads, as you suggest), as documented in Section 7: Future Work. |
In the community's compiler's default compilation mode, I would expect that optimizations peculiar to a single transport would not apply. But I think that we agree that there are target interconnects, or families of interconnects, for which specialized optimization could be beneficial, and for which the necessary support in your API could be designed now. Having the hooks in your API that I mentioned above (remote address calculation, split asynchronous transactions) would make it more attractive as a common solution. One memory-mapped interconnect of interest to me is not "single-node", namely NVLink with NVSwitch as a multi-node GPU fabric (https://www.nvidia.com/en-us/data-center/nvlink/). If the compiler and runtime can support optimized compilation for this fabric, then Fortran's coarrays may become a viable parallel & accelerated programming model for such systems. |
This has not been addressed. |
Can you remind me why collective buffer arguments are |
We've just produced a new revision of this document, for those interested in reviewing a nicely formatted version, the following link provides a PDF: The changes from the previous revision are:
|
f6a947d
to
0473bd8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Status ReportThis design document PR has gone through three major revisions in response to reviewer comments in the approximately 13 months since the PR was submitted. A partial implementation exists and is making progress in Caffeine, which has gone through three corresponding release cycles over the same period. We have received approval from two reviewers external to our organization. Unless we receive additional reviews here, we plan to merge this PR on or after January 22. |
Does the latest revision support APIs for the two features that I mentioned above (remote address calculation and split asynchronous transactions)? If so, great, ship it. |
This document records the runtime interface specification that will be used for supporting the multi-image parallel features of LLVM flang. Co-authored-by: Katherine Rasmussen <[email protected]> Co-authored-by: Brad Richardson <[email protected]> Co-authored-by: Damian Rouson <[email protected]>
0473bd8
to
1bc8387
Compare
Thank you for the feedback, @kiranchandramohan and @kbeyls. We have updated this PR to address your concerns and are ready for your review.
We have updated the PR and we believe the license concern is now resolved.
The intended scope of PRIF is the multi-image parallel features of Fortran, as outlined by the exhaustive list in section 2. This deliberately omits single-image parallel features such as
Yes, the features that @klausler asked for are already explicitly addressed in our future work section 7. We agree with the long-term goal of eventually having these features available. We believe it will be possible to later augment the PRIF spec with procedure interfaces that add these features without breaking backwards compatibility with earlier PRIF revisions.
We don’t intend for this Markdown file to be the canonical definition of PRIF. We want PRIF to be a compilation target for several compilers, and to carry its current Creative Commons CC BY-ND license. For both of these reasons, we now believe we should replace the contents of this PR with a brief document that cites a DOI for an externally hosted open-access copy of the latest PRIF specification. We have not yet settled on a detailed formal process for updating PRIF in future revisions, but we intend it to be an inclusive, open process involving representation from all relevant stakeholders.
We propose PRIF as the sole mechanism for supporting flang’s multi-image features, while allowing and supporting multiple possible library implementations of PRIF.
No. As an example of potential alternatives, @jeffhammond has communicated an interest in writing an MPI implementation of PRIF and possibly an OpenSHMEM implementation also. |
Thanks, @ktras, for updating and resolving the license issue and the detailed reply.
It will be good to specify this as early as possible to ensure that everyone is OK with the process and anyone interested can participate.
Since this is a design decision that is being taken, it is good to add a few more reviewers from organisations participating in Flang. |
Hi, We've been iterating from several month with Berkeley Lab team in order to have a working / satisfying prototype in LLVM to plug PRIF. I've starting to drop here https://github.com/SiPearl/llvm-project/tree/prif some commits. This is still WIP, and the goal is to retrieve feedbacks, so feel free to comments. These commits include first versions for :
To test it, simply build flang and a PRIF implementation (Caffeine) and run |
|
||
# Multi-Image Parallel Fortran Runtime | ||
|
||
LLVM Flang targets the Parallel Runtime Interface for Fortran (PRIF) for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's premature to say that LLVM Flang targets PRIF. I'm looking forward to a day when all of the hard-work on PRIF is rewarded with an upstream implementation! Until then, how about rewording this to be (roughly):
PRIF defines an interface designed for LLVM Flang to target implementations of Fortran's multi-image parallel features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment @sscalpone. As you suggested, we have updated the text to address your feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very well done. Thank you!
Co-authored-by: Katherine Rasmussen <[email protected]> Co-authored-by: Damian Rouson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Feedback received from @kiranchandramohan expressed a desire for a clear process on how PRIF will be updated in the future and how LLVM Flang developers can be involved. In response, we have formed the PRIF Specification Committee, with a governance document detailing the process by which changes to PRIF are proposed and incorporated into the specification. The Committee meets in monthly virtual meetings and anyone is welcome to join. We have an open mailing list here, [email protected], which anyone can join or view. We believe this resolves the stated reviewer feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG. The governance model that allows others to participate and modify PRIF addresses my concerns.
Please reach out to others who had comments to check whether they are OK before submitting.
Shouldn't PRIF just become part of the LLVM project if it's a critical dependency? Wouldn't that address the governance issues automatically? |
@jeffhammond thanks for the suggestion. The goal is for PRIF to support multiple compilers, including ones that don't have LLVM as a dependency. The governance document was drafted in response to the 23 January 2025 comment by @kiranchandramohan, who acknowledges above that the developed governance model addresses the stated concern. |
PRIF Specification Revisions 0.3, 0.4, and 0.5 address comments received since the December 2023 creation of this PR and the publication of PRIF 0.2. We believe all the outstanding review comments have been addressed. Additionally, the Caffeine library implementation of PRIF has progressed to support nearly all of PRIF 0.5 (https://go.lbl.gov/caffeine-status). We have received approval from three reviewers external to our organization. We intend to merge this PR on or after July 16th, unless someone explains here a way in which we have not adequately addressed the reviews to date. |
Then why does it make sense to make this specification part of LLVM documentation? Would it not make more sense as part of its implementation repository, or maybe its own thing altogether? |
We agree that it makes the most sense for PRIF “to be its own thing altogether”. PRIF is now an open specification governed by a committee of stakeholders. This PR just adds a cross-reference to that specification, which will be used in future PR’s to add runtime support for multi-image features. |
This document specifies the interface design for supporting the parallel features of flang.
For those interested in reviewing the document, and would like a nicer formatted copy, the following is a link to a PDF version:
https://doi.org/10.25344/S4CG6G