-
Notifications
You must be signed in to change notification settings - Fork 13.9k
for a more even partitioning inline before merge #65281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
@bors try @rust-timer queue |
⌛ Trying commit 6396e58bc8d6fdf1646161ea28e01445343e3a74 with merge 977495b636d186bcf98d6f7360b5d74c4d1bf5d6... |
☀️ Try build successful - checks-azure |
@rust-timer build 977495b636d186bcf98d6f7360b5d74c4d1bf5d6 |
Queued 977495b636d186bcf98d6f7360b5d74c4d1bf5d6 with parent 58b5491, future comparison URL. |
Thanks for the PR, @andjo403! This might be a great find! Generally, the changes look good to me. I want to think a bit more if there was a reason not to do things in this order in the first place and, if yes, if that reason still applies today. Also, it looks like some of the code here does some unnecessary work around size estimation and mono-item placement (even before this PR). That should be fixed as part of this PR. I'll take another (closer) look on Monday. |
Finished benchmarking try commit 977495b636d186bcf98d6f7360b5d74c4d1bf5d6, comparison URL. |
Perf result are underwhelming, is is possible that the Perf computer is not powerful enough for this to help? |
IIRC perf is running on 4C/8T CPU and OP has 32 threads. That can make big difference. |
As there is 16CGUs by default at least 16 threads is probably needed for the large time win |
Also the wall times is showing the result better |
Yes, perf may not be sufficiently parallel to see improvements here. However, the lack of regressions seems good - if we are reliably seeing good results locally, I think landing probably makes sense from a performance perspective? |
Hm, so one interesting thing here is that adding size estimates is not quite right for predicting the size of the merged CGU, because with inlining there can be duplicates between the mono-item. For example, say we have
Let's say (for simplicity's sake) that all functions have a size of
I'm not sure if it would make a difference if estimations of merged CGU would take that into account. |
@nnethercote, what's your take on these performance numbers? There are some regressions (like |
I do not understand how there can be duplicates after merge? There is only one map with all the mono items. It was to handle the not duplicated function that I was calling the estimate function after each merge |
@andjo403 You are right, there cannot be duplicates after merging. What I was writing about is that the pre-merge logic does not account for duplicates. I'm thinking of something like the following example
The logic in the PR would produce:
while a smarter algorithm could produce:
I don't know if such examples are common in practice. It might be worth a try though. Note, that the algorithm that is in the compiler currently would still do worse than the one in the PR |
if we think that it is to much overhead to call about the regressions what I can see it is the LLVM_module_optimize_module_passes that stands for most of it. do not understand how this spilt affects the debug builds where no inlining is happening. |
have been thinking about a smarter merger that can take in to account the duplicated functions but have not been able to find a algorithm that will not be brute force and it will take to long time |
Yes, the naive approach is probably at least O(n²) where n=number of CGUs. Sounds like a fun challenge |
Those numbers are very messy and inconclusive. Unfortunately can't use instruction counts for this one, so we have to look at wall-times which have high variance. Based on just the numbers, I would say that the combination of improvements and regressions mean that it's not a clear win, and so I would be conservative and recommend against landing it. But I don't know anything about the theory of the patch, and whether that is compelling. It might be interesting to do another perf run and see how much the numbers change. That would at least give an idea of what is and is not noise. |
@michaelwoerister if we shall try another perf run did you have some changes that you wanted? thinks that it is a good idea to rebase also before the rerun as the base also can change the outcome. |
Thanks for taking a look, @nnethercote! We'll do another perf-run. @andjo403, yes, depending on how much you want to tinker:
|
consider the size of the inlined items for a more even partitioning
6396e58
to
a465d97
Compare
@michaelwoerister rebased and made an attempt to fix the comments |
Thanks, @andjo403! Let's do the perf run. @bors try @rust-timer queue |
Awaiting bors try build completion |
for a more even partitioning inline before merge consider the size of the inlined items for a more even partitioning for me this change take the compile time for script-servo-opt from 306s to 249s edit: the times is for a 32 thread cpu. cc #64913
☀️ Try build successful - checks-azure |
Queued f0ea016 with parent 7e49800, future comparison URL. |
Finished benchmarking try commit f0ea016, comparison URL. |
Thanks for implementing size estimation caching, @andjo403! The new code is actually nicer than before. From the second perf run it looks like I'm a bit conflicted about what to do with the PR. It seems like an intrinsically good idea but it does not seem to universally improve the situation. I'm still wondering if there is a (simple) way to make merging smarter with respect to inline function duplication. We wouldn't need perfect results. |
@michaelwoerister I can not find a "simple" solution to merge have been looking in to some kind of minHash but that will only tell what sets of functions that is similar so it can be used to sort but then the question still is how to select what to merge we do not want to merge two large CGUs only due to they are similar, I'm ok with closing the PR if this is not what we want even if it is sad for me that have many cores and can not use them |
@andjo403 Would you be up for doing a local run of our benchmark suite? Then we would have a clearer picture on how performance is at higher core counts. We could also maybe just make the behavior depend on the actual core count of the machine (although that might turn out to be tricky because the jobserver gives out tokens dynamically). |
shall I close this PR? |
Thanks for collecting that data, @andjo403! I concur that the results are too mixed for making this change. Let's close this PR but let's also make sure to record the findings in a followup issue. |
@michaelwoerister I think that I have changed my mind about that we maybe shall merge it any way as the perf numbers now build on luck more then some deliberate algorithm. also think that the mir-inliner PR #68213 have the same problem. |
@andjo403 It probably makes sense for someone to look into this. What we have at the moment is strictly a "better than nothing" solution. |
consider the size of the inlined items for a more even partitioning
for me this change take the compile time for script-servo-opt from 306s to 249s
edit: the times is for a 32 thread cpu.
cc #64913