Skip to content

Parallel runtime library design doc (PRIF) #76088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 16, 2025

Conversation

everythingfunctional
Copy link
Contributor

@everythingfunctional everythingfunctional commented Dec 20, 2023

This document specifies the interface design for supporting the parallel features of flang.

For those interested in reviewing the document, and would like a nicer formatted copy, the following is a link to a PDF version:
https://doi.org/10.25344/S4CG6G

@everythingfunctional everythingfunctional force-pushed the parallel-runtime-library-design-doc branch from 1c0e485 to e6bfa3f Compare December 22, 2023 14:19
@llvmbot llvmbot added the flang Flang issues not falling into any other category label Dec 22, 2023
@everythingfunctional everythingfunctional force-pushed the parallel-runtime-library-design-doc branch from e6bfa3f to 2bace2e Compare December 22, 2023 14:21
@everythingfunctional
Copy link
Contributor Author

ping. Just trying to see if anybody has had a chance to look at this or could recommend someone to review it. FYI, if you'd rather read formatted PDF instead of the Markdown, you can find a version here: https://dx.doi.org/10.25344/S4DG6S

@everythingfunctional everythingfunctional changed the title Parallel runtime library design doc Parallel runtime library design doc (PRIF) Apr 5, 2024
@jeffhammond
Copy link
Member

My (partial) review:

  • All communication operations are locally blocking. There is no way for the compiler to defer or aggregate synchronization.
  • prif_put says it blocks on completion. prif_put_raw doesn't say it blocks. I infer that it does, but every function needs to be clear. (same for get)
  • prif_{get,put}_strided need to exist to allow for an efficient implementation over MPI. The MPI_Win argument is going to be in type(prif_coarray_handle) and therefore that argument needs to be present for all communication that is going to lead to MPI RMA.
  • Please do not use c_intmax_t. Just use c_size_t for bounds, etc.
  • final_func takes the coarray handle as intent(in). Is there no use case for final_func modifying the state inside of the coarray handle?
  • prif_base_pointer should be immediately prior to the raw operations since the usage will be paired. I know it's a query but locality of reference when reading the document is important.
  • For example, this document references the term coindexed-named-object multiple times, but does not define it since it is part of the language and the Fortran standard defines it. As such, in order to fully understand the PRIF specification, it is critical to read and reference the Fortran standard alongside it. I appreciate the idea here but this is going to be hostile to potential implementers of PRIF who are not Fortran language lawyers. The Fortran standard is not easy to read and some of the people who implement PRIF may be non-Fortran programmers. Requiring Fortran language expertise much beyond CFI will impede implementation efforts.

@jeffhammond
Copy link
Member

Put with notification is nigh impossible to implement efficiently as specified. Notification variables must be in coarrays. Having the only argument in prif_put(_raw) be notify_ptr requires an MPI-based implementation (or any other that can't do remote atomic writes to virtual memory directly) to do an expensive lookup on the critical path of what is supposed to be a high-performance operation.

You should implement the put+notify with a separate function that has a second
type(prif_coarray_handle) argument for the notification variable.

@jeffhammond
Copy link
Member

prif_base_pointer does not describe the failure mode when memory is not accessible via load-store at the specified image. This function seems equivalent to shmem_ptr, which is documented to return NULL when remote memory is not accessible via pointers.

Assuming I infer correctly that this function plus the raw operations are designed to optimize for shared-memory communication via load-store, there needs to be a high-level semantic overview of the purpose of these functions and their difference from the non-raw operations (which need to include strided variants, as noted above).

If I do not infer correctly the use case, then I disagree with the design.

Side note: I have implemented OpenSHMEM on top of MPI with the general case and specialization for the shared-memory path (https://github.com/jeffhammond/oshmpi/blob/master/src/shmem-internals.c). It is going to be important to support both use cases because we want coarrays to be competitive with shared-memory parallelism (e.g. OpenMP) when programs run within a shared-memory domain.

@everythingfunctional
Copy link
Contributor Author

FYI, if anyone would like to review the new revision of this document in pdf form, it can be found at http://doi.org/10.25344/S4501W

@bonachea
Copy link
Contributor

Everyone:

If you have feedback about specific APIs or portions of the document, please use the "Files Changed" Pane in the PR to attach comments on appropriate lines of the document.

This will enable us to hold organized/threaded discussions about individual topics, which is not really feasible on this non-threaded Conversation tab.

Thanks.

Copy link
Contributor

@ktras ktras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With permission, copied over comments from @jeffhammond to specific parts in the document to establish a place to respond to each comment.

@everythingfunctional everythingfunctional force-pushed the parallel-runtime-library-design-doc branch from 51505e2 to 26e1ed5 Compare July 17, 2024 17:24
@everythingfunctional
Copy link
Contributor Author

We've just produced a new revision of this document, for those interested in reviewing a nicely formatted version, the following link provides a PDF:
http://doi.org/10.25344/S4WG64

The changes from the previous revision are:

  • Changes to Coarray Access (puts and gets):

    • Refactor to provide separate procedure interfaces for the various combinations of:
      direct vs indirect target location, puts with or without a notify-variable,
      direct vs indirect notify-variable location, and strided vs contiguous data access.
    • Add discussion of direct and indirect location accesses to
      the Design Decisions and Impact section
    • Rename _raw_ procedures to _indirect_
    • Replace cosubscripts, team, and team_number arguments with image_num
    • Replace first_element_addr arguments with offset
    • Replace type(*) value arguments with size and current_image_buffer
    • Rename remote_ptr_stride arguments to remote_stride
    • Rename current_image_buffer_stride arguments to current_image_stride
    • Rename size arguments to size_in_bytes
  • Other changes to PRIF procedure interfaces:

    • Establish a new uniform argument ordering across all non-collective
      communication procedures
    • Remove prif_base_pointer. Direct access procedures should be used instead.
    • Add direct versions of prif_event_post, prif_lock, and
      prif_unlock and rename previous versions to ..._indirect
    • Convert prif_num_images into three different procedures with no
      optional arguments, in order to more closely align with the
      Fortran standard. Do the same with prif_image_index.
    • Correct the kind for atomic procedures from atomic_int_kind to PRIF_ATOMIC_INT_KIND
      and from atomic_logical_kind to PRIF_ATOMIC_LOGICAL_KIND
    • Remove target attribute from coarray_handles argument in prif_deallocate_coarray
    • Rename element_length argument in prif_allocate_coarray to element_size
    • Rename image_index argument in prif_this_image_no_coarray to this_image
    • Remove generic interfaces throughout
  • Miscellaneous new features:

    • Allow multiple calls to prif_init from each process, and add
      PRIF_STAT_ALREADY_INIT constant
    • Add new PRIF-specific constants PRIF_VERSION_MAJOR and PRIF_VERSION_MINOR
  • Narrative and editorial improvements:

    • Add/improve Common Arguments subsections and add links to them
      below procedure interfaces
    • Elide argument lists for all procedures and add prose explaining
      how the PRIF specification presents the procedure interfaces
    • Add client notes to subsections introducing PRIF Types, and permute subsection order
    • Add guidance to clients regarding coarray dummy arguments
    • Remove grammar non-terminals, including coindexed-named-object
    • Add several terms to the glossary
    • Numerous minor wording changes throughout

@everythingfunctional everythingfunctional force-pushed the parallel-runtime-library-design-doc branch from 9fa1eab to 265f177 Compare August 22, 2024 14:19
@klausler
Copy link
Contributor

klausler commented Oct 2, 2024

I am not a reviewer for this PR, and I usually don't intrude on flang community activity, but I have read the document and have a couple of suggestions to make, mostly from an optimization perspective.

In brief: Target hardware for corray Fortran includes two important subsets: those targets whose interconnects admit direct load/store access to remote data, and those whose data transfers are driven by controlling an RDMA NIC's MMRs. A user who wants to optimize for a specific target interconnect, or class of interconnects, should be able to stipulate so with a command-line option, and might get better performance than they would from a compilation that supports any possible target interconnect.

When the target interconnect is known at compilation time to be one of those that support load/store access to remote memory, it would be useful to have a runtime library interface to perform the necessary address calculation to compute a remote base address for a given coarray on a particular image, if that image is not known to have failed. This would allow an optimizer to amortize the cost of that calculation when multiple references to the same corray/image will follow, and would enable loop transformations to prefetch data and hide load latency.

On the other hand, when the target interconnect is known at compilation time to be one that supports asynchronous transactions, it would similarly be useful to have runtime library interfaces to initiate asynchronous reads and await their completions, again for hiding load latency. (Other optimizations that might apply for these targets, such as software caching, can be left to the target's runtime library implementation, I think.)

@bonachea
Copy link
Contributor

Hi @klausler - I’m part of the PRIF team at Berkeley Lab. Thanks for the great questions!

@klausler said:

Target hardware for coarray Fortran includes two important subsets: those targets whose interconnects admit direct load/store access to remote data, and those whose data transfers are driven by controlling an RDMA NIC's MMRs. [...] it would be useful to have a runtime library interface to perform the necessary address calculation to compute a remote base address for a given coarray on a particular image, [...] allow an optimizer to amortize the cost of that calculation when multiple references to the same corray/image will follow

The actual hardware landscape is more complicated than you seem to imply. In particular, many modern HPC platforms demonstrate a combination of both characteristics you describe, i.e., cache-coherent load-store access between cores/processes running within a single physical memory domain (i.e. "intra-node") AND RDMA access across an explicit interconnect between physical domains (i.e. "inter-node"). In general we care about deployments where both classes of transport may be simultaneously active in a given job execution, and the transport distinction is not (in general) globally static, but instead depends on the physical placement of the images involved in a given communication operation. A purely "single-node" system, where all images happen to use an intra-node transport is just a special case of this more general situation.

The overheads involved in initiating communication over an RDMA transport are generally on the order of microseconds (corresponding to thousands of cycles on a modern processor), and usually dwarf details like serial instructions for address arithmetic by several orders of magnitude. Amortizing constant-time "setup" overheads for RDMA communication is unlikely to be a fruitful optimization, and would require exposing non-portable interconnect-specific details to the PRIF client.

However a load/store transport through hardware-managed shared memory tends to be orders of magnitude faster (in overhead and latency), and in this case overheads like address translation are expected to have a greater relative impact. Here the cost of redundant address calculations and even the cost of extra procedure calls may become significant relative to the cost of a load/store-based data transfer. As such, we agree that on such transports, there are opportunities for fruitful amortization of communication "setup" overheads. We envision eventually expanding PRIF with calls allowing the client to detect and take advantage of this situation when appropriate. This is explicitly documented in Section 7: Future Work.

@klausler said:

when the target interconnect is known at compilation time to be one that supports asynchronous transactions, it would similarly be useful to have runtime library interfaces to initiate asynchronous reads and await their completions, again for hiding load latency.

We agree that on multi-node networks, asynchronous communication is an effective means to hide communication latency by overlapping it with computation or other communication. Our group has a long history of exploiting those types of communication optimizations in the context of other parallel programming models. PRIF currently lacks entry points for explicitly asynchronous communication operations, because we wanted to start with the simplest interface that would allow a complete and compliant implementation of Fortran’s multi-image parallel semantics. We would like to see future revisions of PRIF add extensions for explicitly asynchronous communication (especially for coindexed reads, as you suggest), as documented in Section 7: Future Work.

@klausler
Copy link
Contributor

A purely "single-node" system, where all images happen to use an intra-node transport is just a special case of this more general situation.

In the community's compiler's default compilation mode, I would expect that optimizations peculiar to a single transport would not apply. But I think that we agree that there are target interconnects, or families of interconnects, for which specialized optimization could be beneficial, and for which the necessary support in your API could be designed now. Having the hooks in your API that I mentioned above (remote address calculation, split asynchronous transactions) would make it more attractive as a common solution.

One memory-mapped interconnect of interest to me is not "single-node", namely NVLink with NVSwitch as a multi-node GPU fabric (https://www.nvidia.com/en-us/data-center/nvlink/). If the compiler and runtime can support optimized compilation for this fabric, then Fortran's coarrays may become a viable parallel & accelerated programming model for such systems.

@jeffhammond
Copy link
Member

  • Please do not use c_intmax_t. Just use c_size_t for bounds, etc.

This has not been addressed. intmax_t is bad and should not be used. https://thephd.dev/intmax_t-hell-c++-c has some info and @erichkeane has some strong words about it elsewhere on the internet.

@jeffhammond
Copy link
Member

Can you remind me why collective buffer arguments are contiguous, target? Is there any assumption that the compiler copies non-contiguous subarrays here? What is the target attribute doing? Thanks.

@everythingfunctional
Copy link
Contributor Author

We've just produced a new revision of this document, for those interested in reviewing a nicely formatted version, the following link provides a PDF:
https://doi.org/10.25344/S4CG6G

The changes from the previous revision are:

  • Convert all instances of c_intmax_t to c_int64_t
  • Replace lbounds, ubounds, element_size arguments in prif_allocate_coarray with size_in_bytes
  • Specify definition of prif_coarray_handle to be a derived type with one member that is a pointer
  • Add prif_register_stop_callback
  • Remove contiguous attribute from argument a in collective subroutines
  • Update argument list to prif_co_reduce to use operation_wrapper and add cdata argument
  • Add prif_co_max_character and prif_co_min_character
  • Constrain argument type for argument a to prif_co_max, prif_co_min, and prif_co_sum
  • Add client note indicating that prif_co_reduce may be used to support other collective calls where a is not an interoperable type
  • Clarify the semantics of derived types passed to prif_co_broadcast
  • Add prif_local_data_pointer
  • Prohibit overlap between source and destination memory regions in coarray access
  • Numerous minor editorial changes throughout

@everythingfunctional everythingfunctional force-pushed the parallel-runtime-library-design-doc branch from f6a947d to 0473bd8 Compare December 24, 2024 15:05
Copy link

@etienne-renault etienne-renault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rouson
Copy link
Contributor

rouson commented Jan 15, 2025

Status Report

This design document PR has gone through three major revisions in response to reviewer comments in the approximately 13 months since the PR was submitted. A partial implementation exists and is making progress in Caffeine, which has gone through three corresponding release cycles over the same period. We have received approval from two reviewers external to our organization.

Unless we receive additional reviews here, we plan to merge this PR on or after January 22.

@klausler
Copy link
Contributor

Unless we receive additional reviews here, we plan to merge this PR on or after January 22.

Does the latest revision support APIs for the two features that I mentioned above (remote address calculation and split asynchronous transactions)? If so, great, ship it.

This document records the runtime interface specification that
will be used for supporting the multi-image parallel features of LLVM flang.

Co-authored-by: Katherine Rasmussen <[email protected]>
Co-authored-by: Brad Richardson <[email protected]>
Co-authored-by: Damian Rouson <[email protected]>
@bonachea bonachea force-pushed the parallel-runtime-library-design-doc branch from 0473bd8 to 1bc8387 Compare January 22, 2025 21:51
@ktras
Copy link
Contributor

ktras commented Jan 22, 2025

Thank you for the feedback, @kiranchandramohan and @kbeyls. We have updated this PR to address your concerns and are ready for your review.

Would be good to get the additional license/copyright specification sanitized by the llvm board members. @kbeyls @tlattner

We have updated the PR and we believe the license concern is now resolved.

This document specifies the interface design for supporting the parallel features of flang.
This could be mistaken to mean that all parallel features like OpenMP, OpenACC, Do concurrent should use this interface. Is there a better way to specify this? Like support for coarrays or images in Fortran?

The intended scope of PRIF is the multi-image parallel features of Fortran, as outlined by the exhaustive list in section 2. This deliberately omits single-image parallel features such as do concurrent and OpenMP/OpenACC integration. We will clarify this via prose updates in a future revision of the document, but feel this alone does not justify a new revision. We have also changed the name of the new version of the markdown file in this PR to use the term "multi-image" to indicate the intended scope.

It will be great if you could clarify :

  1. whether the concerns that Peter raised could be handled by augmenting or modifying the PRIF spec.

Yes, the features that @klausler asked for are already explicitly addressed in our future work section 7. We agree with the long-term goal of eventually having these features available. We believe it will be possible to later augment the PRIF spec with procedure interfaces that add these features without breaking backwards compatibility with earlier PRIF revisions.

  1. What is the process for changes to PRIF. Is it OK to modify by making changes to this document?

We don’t intend for this Markdown file to be the canonical definition of PRIF. We want PRIF to be a compilation target for several compilers, and to carry its current Creative Commons CC BY-ND license. For both of these reasons, we now believe we should replace the contents of this PR with a brief document that cites a DOI for an externally hosted open-access copy of the latest PRIF specification. We have not yet settled on a detailed formal process for updating PRIF in future revisions, but we intend it to be an inclusive, open process involving representation from all relevant stakeholders.

  1. Is this intended as one of the mechanism for supporting coarrays in Flang or as the only mechanism?

We propose PRIF as the sole mechanism for supporting flang’s multi-image features, while allowing and supporting multiple possible library implementations of PRIF.

  1. Is there an implication that caffeine is the only library that should be used with Flang for co-arrays?

No. As an example of potential alternatives, @jeffhammond has communicated an interest in writing an MPI implementation of PRIF and possibly an OpenSHMEM implementation also.

@kiranchandramohan
Copy link
Contributor

Thanks, @ktras, for updating and resolving the license issue and the detailed reply.

We have not yet settled on a detailed formal process for updating PRIF in future revisions, but we intend it to be an inclusive, open process involving representation from all relevant stakeholders.

It will be good to specify this as early as possible to ensure that everyone is OK with the process and anyone interested can participate.

We propose PRIF as the sole mechanism for supporting flang’s multi-image features, while allowing and supporting multiple possible library implementations of PRIF.

Since this is a design decision that is being taken, it is good to add a few more reviewers from organisations participating in Flang.

@JDPailleux
Copy link
Contributor

Hi,

We've been iterating from several month with Berkeley Lab team in order to have a working / satisfying prototype in LLVM to plug PRIF. I've starting to drop here https://github.com/SiPearl/llvm-project/tree/prif some commits. This is still WIP, and the goal is to retrieve feedbacks, so feel free to comments.

These commits include first versions for :

  • INIT and STOP (only for the global finalization, not the intrinsic)
  • THIS_IMAGE
  • NUM_IMAGES
  • Collectives : CO_SUM, CO_MIN, CO_MIN_CHARACTER, CO_MAX, CO_MAX_CHARACTER, CO_BROADCAST
  • LCOBOUNDS, UCOBOUNDS and COSHAPE
  • IMAGE_STATUS and FAIL_IMAGE
  • Allocation
  • FORM_TEAM, END TEAM, CHANGE TEAM, GET_TEAM, TEAM_NUMBER.
    A start has been made on TEAM features and variables based on the TEAM_TYPE used by Flang. However, checks need to be carried out to see whether the lowering of arguments/variables of this type can be compatible between LLVM type and the type defined by PRIF.

To test it, simply build flang and a PRIF implementation (Caffeine) and run flang -lcaffeine test.f90.


# Multi-Image Parallel Fortran Runtime

LLVM Flang targets the Parallel Runtime Interface for Fortran (PRIF) for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's premature to say that LLVM Flang targets PRIF. I'm looking forward to a day when all of the hard-work on PRIF is rewarded with an upstream implementation! Until then, how about rewording this to be (roughly):
PRIF defines an interface designed for LLVM Flang to target implementations of Fortran's multi-image parallel features

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment @sscalpone. As you suggested, we have updated the text to address your feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well done. Thank you!

Co-authored-by: Katherine Rasmussen <[email protected]>
Co-authored-by: Damian Rouson <[email protected]>
@ktras ktras requested a review from sscalpone January 29, 2025 18:36
@ktras ktras requested a review from jeanPerier April 2, 2025 18:04
Copy link

@tmjbios tmjbios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ktras ktras requested a review from tmjbios April 25, 2025 22:22
@ktras
Copy link
Contributor

ktras commented Jul 2, 2025

Feedback received from @kiranchandramohan expressed a desire for a clear process on how PRIF will be updated in the future and how LLVM Flang developers can be involved.

In response, we have formed the PRIF Specification Committee, with a governance document detailing the process by which changes to PRIF are proposed and incorporated into the specification. The Committee meets in monthly virtual meetings and anyone is welcome to join. We have an open mailing list here, [email protected], which anyone can join or view. We believe this resolves the stated reviewer feedback.

Copy link
Contributor

@kiranchandramohan kiranchandramohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG. The governance model that allows others to participate and modify PRIF addresses my concerns.

Please reach out to others who had comments to check whether they are OK before submitting.

@jeffhammond
Copy link
Member

Shouldn't PRIF just become part of the LLVM project if it's a critical dependency? Wouldn't that address the governance issues automatically?

@rouson
Copy link
Contributor

rouson commented Jul 7, 2025

@jeffhammond thanks for the suggestion. The goal is for PRIF to support multiple compilers, including ones that don't have LLVM as a dependency. The governance document was drafted in response to the 23 January 2025 comment by @kiranchandramohan, who acknowledges above that the developed governance model addresses the stated concern.

@ktras
Copy link
Contributor

ktras commented Jul 7, 2025

PRIF Specification Revisions 0.3, 0.4, and 0.5 address comments received since the December 2023 creation of this PR and the publication of PRIF 0.2. We believe all the outstanding review comments have been addressed. Additionally, the Caffeine library implementation of PRIF has progressed to support nearly all of PRIF 0.5 (https://go.lbl.gov/caffeine-status).

We have received approval from three reviewers external to our organization. We intend to merge this PR on or after July 16th, unless someone explains here a way in which we have not adequately addressed the reviews to date.

@klausler
Copy link
Contributor

klausler commented Jul 7, 2025

The goal is for PRIF to support multiple compilers, including ones that don't have LLVM as a dependency.

Then why does it make sense to make this specification part of LLVM documentation? Would it not make more sense as part of its implementation repository, or maybe its own thing altogether?

@bonachea
Copy link
Contributor

bonachea commented Jul 7, 2025

Would it not make more sense as part of its implementation repository, or maybe its own thing altogether?

We agree that it makes the most sense for PRIF “to be its own thing altogether”. PRIF is now an open specification governed by a committee of stakeholders. This PR just adds a cross-reference to that specification, which will be used in future PR’s to add runtime support for multi-image features.

@ktras ktras merged commit 8d21025 into llvm:main Jul 16, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flang Flang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.