Add new self-profiling event to cheaply aggregate query cache hit counts #142978

Kobzol · 2025-06-24T18:20:32Z

Self-profile can record various types of things, some of them are not enabled, like query cache hits. Rustc currently records cache hits as "instant" measureme events, which records the thread ID, current timestamp, and constructs an individual event for each such cache hit. This is incredibly expensive, in a small hello world benchmark that just depends on serde, it makes compilation with nightly go from ~3s (with -Zself-profile) to ~15s (with -Zself-profile -Zself-profile-events=default,query-cache-hit).

We'd like to add query cache hits to rustc-perf (rust-lang/rustc-perf#2168), but there we only need the actualy cache hit counts, not the timestamp/thread ID metadata associated with it.

This PR adds a new query-cache-hit-count event. Instead of generating individual instant events, it simply aggregates cache hit counts per query invocation (so a combination of a query and its arguments, if I understand it correctly) using an atomic counter. At the end of the compilation session, these counts are then dumped to the self-profile log using integer events (in a similar fashion as how we record artifact sizes). I suppose that we could dedup the query invocations in rustc directly, but I don't think it's really required. In local experiments with the hello world + serde case, the query invocation records generated ~30 KiB more data in the self-profile, which was ~10% increase in this case.

With this PR, the overhead of -Zself-profile seems to be the same as before, at least on my machine, so I also enabled query cache hit counts by default when self profiling is enabled.

We should also modify analyzeme, specifically this, and make it load the integer events with query cache hit counts. I can do that as a follow-up, it's not required to be done in sync with this PR, and it doesn't require changes in rustc.

CC @cjgillot

r? @oli-obk

Kobzol · 2025-06-24T18:25:44Z

https://github.com/search?type=code&q=query-cache-hits looks like no one used this anyway.. 😆

Mark-Simulacrum · 2025-06-25T00:49:33Z

compiler/rustc_data_structures/src/profiling.rs

+    /// With this approach, we don't know the individual thread IDs and timestamps
+    /// of cache hits, but it has very little overhead on top of `-Zself-profile`.
+    /// Recording the cache hits as individual events made compilation 3-5x slower.
+    query_hits: RwLock<FxHashMap<QueryInvocationId, AtomicU64>>,


Could you switch this to using a dense map, e.g. IndexVec? QueryInvocationId should be monotonically assigned I think and so this should end up dense.

Are they allocated monotonically in the order of executed queries though? 🤔 We don't know before the start of rustc how many invocations there will be (I assume, since it includes queries combined with the unique argument combinations), so we can't preallocate it. So the only thing we could do is .push() on demand (if the new ID is one larger than the size of the vec), and lookup by index. Is that what you meant?

It doesn't look like they are strictly monotonic:

ID: 2 ID: 2 ID: 4 ID: 1 ID: 0 ID: 4 ID: 7 ID: 8 ID: 1 ID: 9 ID: 10 ID: 11 ID: 12 ID: 13 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1 ID: 1

I guess that it depends on the invocations being cached or not, loaded from disk, etc. I don't think we can count on them actually arriving in order.

That being said, instead of push, I suppose that we could do something like query_hits.resize(new_observed_max_id, 0). Do you want me to do that?

Pushed a change to vec, let me know what do you think (I didn't use IndexVec, because we index it with QueryInvocationId just once, the rest of operations (like resize or iterating) works with usize anyway. Also, we would need to implement Idx for it, which requires working with usize, but he invocation ID only stores u32.

I wonder if we can hit some pathological cases here if we keep resize_withing a vec by one each time...

Mark-Simulacrum · 2025-06-25T00:53:34Z

compiler/rustc_data_structures/src/profiling.rs

-                |profiler| profiler.query_cache_hit_event_kind,
-                query_invocation_id,
-            );
+            profiler_ref.profiler.as_ref().unwrap().increment_query_cache_hit(query_invocation_id);
        }

        if unlikely(self.event_filter_mask.contains(EventFilter::QUERY_CACHE_HITS)) {


Any reason to change the existing event rather than adding a new one that only tracks counts?

I think this is losing the information for the query "tree" that was previously present, right? It used to be possible to generate a flamegraph of queries but now since there's no timing/thread information we can't track the parent relationships.

That doesn't seem consistently useful, but it also doesn't seem useless to me...

Well, I figured that it wasn't really used in practice (haven't found anything on GitHub code search), and it was quite expensive. A practical reason to avoid adding a new filter event was to avoid having two mask checks in this very hot function. But the cost of that (with -Zself-profile enabled) is probably still miniscule in comparison to what was happening before, and without self-profiling, we could just ask if QUERY_CACHE_HITS | QUERY_CACHE_HITS_COUNT is enabled, to keep a single check in the fast path, so probably it would be fine.

Happy to add a new filter event though, should be simple enough, and wouldn't break backwards compatibility.

How do you generate such a flamegraph that takes query hits into account, btw?

ooh I need to try this out, it could be very useful to see the timestamp difference between query hit counts within another query to analyze performance changes

Kobzol · 2025-06-25T09:10:05Z

I changed the implementation to use a vec instead of a map (although I think map perf. is also ~fine), and added a new event filter (enabled by default), instead of changing the behavior of the old one.

oli-obk · 2025-07-01T15:13:26Z

@bors r+

bors · 2025-07-01T15:13:29Z

📌 Commit e8fc30e has been approved by oli-obk

It is now in the queue for this repository.

Kobzol · 2025-07-01T15:15:17Z

@bors rollup=never

Just in case there's some perf. effect.

Add new self-profiling event to cheaply aggregate query cache hit counts Self-profile can record various types of things, some of them are not enabled, like query cache hits. Rustc currently records cache hits as "instant" measureme events, which records the thread ID, current timestamp, and constructs an individual event for each such cache hit. This is incredibly expensive, in a small hello world benchmark that just depends on serde, it makes compilation with nightly go from ~3s (with `-Zself-profile`) to ~15s (with `-Zself-profile -Zself-profile-events=default,query-cache-hit`). We'd like to add query cache hits to rustc-perf (rust-lang/rustc-perf#2168), but there we only need the actualy cache hit counts, not the timestamp/thread ID metadata associated with it. This PR adds a new `query-cache-hit-count` event. Instead of generating individual instant events, it simply aggregates cache hit counts per *query invocation* (so a combination of a query and its arguments, if I understand it correctly) using an atomic counter. At the end of the compilation session, these counts are then dumped to the self-profile log using integer events (in a similar fashion as how we record artifact sizes). I suppose that we could dedup the query invocations in rustc directly, but I don't think it's really required. In local experiments with the hello world + serde case, the query invocation records generated ~30 KiB more data in the self-profile, which was ~10% increase in this case. With this PR, the overhead of `-Zself-profile` seems to be the same as before, at least on my machine, so I also enabled query cache hit counts by default when self profiling is enabled. We should also modify `analyzeme`, specifically [this](https://github.com/rust-lang/measureme/blob/master/analyzeme/src/analysis.rs#L139), and make it load the integer events with query cache hit counts. I can do that as a follow-up, it's not required to be done in sync with this PR, and it doesn't require changes in rustc. CC `@cjgillot` r? `@oli-obk`

bors · 2025-07-02T05:39:43Z

⌛ Testing commit e8fc30e with merge ece9f3f...

bors · 2025-07-02T06:33:49Z

💔 Test failed - checks-actions

Kobzol · 2025-07-02T08:05:30Z

Oops, apparently 32-bit platforms are a thing. I decided to use portable AtomicU64 instead of AtomicUsize. But maybe it's unreasonable to expect more than 4 billion query cache hits in a single compilation session?

@bors2 try jobs=dist-powerpc-linux

rust-bors · 2025-07-02T08:05:36Z

⌛ Trying commit b49ca02 with merge fc33cd0…

To cancel the try build, run the command @bors2 try cancel.

Add new self-profiling event to cheaply aggregate query cache hit counts Self-profile can record various types of things, some of them are not enabled, like query cache hits. Rustc currently records cache hits as "instant" measureme events, which records the thread ID, current timestamp, and constructs an individual event for each such cache hit. This is incredibly expensive, in a small hello world benchmark that just depends on serde, it makes compilation with nightly go from ~3s (with `-Zself-profile`) to ~15s (with `-Zself-profile -Zself-profile-events=default,query-cache-hit`). We'd like to add query cache hits to rustc-perf (rust-lang/rustc-perf#2168), but there we only need the actualy cache hit counts, not the timestamp/thread ID metadata associated with it. This PR adds a new `query-cache-hit-count` event. Instead of generating individual instant events, it simply aggregates cache hit counts per *query invocation* (so a combination of a query and its arguments, if I understand it correctly) using an atomic counter. At the end of the compilation session, these counts are then dumped to the self-profile log using integer events (in a similar fashion as how we record artifact sizes). I suppose that we could dedup the query invocations in rustc directly, but I don't think it's really required. In local experiments with the hello world + serde case, the query invocation records generated ~30 KiB more data in the self-profile, which was ~10% increase in this case. With this PR, the overhead of `-Zself-profile` seems to be the same as before, at least on my machine, so I also enabled query cache hit counts by default when self profiling is enabled. We should also modify `analyzeme`, specifically [this](https://github.com/rust-lang/measureme/blob/master/analyzeme/src/analysis.rs#L139), and make it load the integer events with query cache hit counts. I can do that as a follow-up, it's not required to be done in sync with this PR, and it doesn't require changes in rustc. CC `@cjgillot` r? `@oli-obk` try-job: dist-powerpc-linux

rust-bors · 2025-07-02T09:45:15Z

☀️ Try build successful (CI)
Build commit: fc33cd0 (fc33cd09cf4f19b8314985e513201c5758a0d506, parent: f51c9870bab634afb9e7a262b6ca7816bb9e940d)

Kobzol · 2025-07-02T09:49:26Z

Looks good.

@bors r=oli-obk

bors · 2025-07-02T09:49:28Z

📌 Commit b49ca02 has been approved by oli-obk

It is now in the queue for this repository.

bors · 2025-07-02T11:41:18Z

⌛ Testing commit b49ca02 with merge b94bd12...

bors · 2025-07-02T14:43:06Z

☀️ Test successful - checks-actions
Approved by: oli-obk
Pushing b94bd12 to master...

github-actions · 2025-07-02T14:46:33Z

What is this?

This is an experimental post-merge analysis report that shows differences in test outcomes between the merged PR and its parent PR.

Comparing f51c987 (parent) -> b94bd12 (this PR)

Test differences

Show 3 test diffs

3 doctest diffs were found. These are ignored, as they are noisy.

Test dashboard

Run

cargo run --manifest-path src/ci/citool/Cargo.toml -- \
    test-dashboard b94bd12401d26ccf1c3b04ceb4e950b0ff7c8d29 --output-dir test-dashboard

And then open test-dashboard/index.html in your browser to see an overview of all executed tests.

Job duration changes

x86_64-apple-1: 9434.0s -> 7843.5s (-16.9%)
dist-ohos-x86_64: 4597.3s -> 4123.1s (-10.3%)
mingw-check-1: 1787.7s -> 1954.1s (9.3%)
mingw-check-tidy: 78.5s -> 72.2s (-8.1%)
dist-arm-linux-gnueabi: 4544.3s -> 4881.7s (7.4%)
dist-armv7-linux: 5284.2s -> 4892.3s (-7.4%)
mingw-check-2: 2093.7s -> 1950.1s (-6.9%)
dist-various-1: 3764.0s -> 3991.5s (6.0%)
dist-powerpc64-linux: 5022.6s -> 5318.7s (5.9%)
dist-loongarch64-linux: 6292.1s -> 5948.5s (-5.5%)

How to interpret the job duration changes?

Job durations can vary a lot, based on the actual runner instance
that executed the job, system noise, invalidated caches, etc. The table above is provided
mostly for t-infra members, for simpler debugging of potential CI slow-downs.

rust-timer · 2025-07-03T07:32:36Z

Finished benchmarking commit (b94bd12): comparison URL.

Overall result: ❌ regressions - please read the text below

Our benchmarks found a performance regression caused by this PR.
This might be an actual regression, but it can also be just noise.

Next Steps:

If the regression was expected or you think it can be justified,
please write a comment with sufficient written justification, and add
@rustbot label: +perf-regression-triaged to it, to mark the regression as triaged.
If you think that you know of a way to resolve the regression, try to create
a new PR with a fix for the regression.
If you do not understand the regression or you think that it is just noise,
you can ask the @rust-lang/wg-compiler-performance working group for help (members of this group
were already notified of this PR).

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

	mean	range	count
Regressions ❌ (primary)	0.3%	[0.1%, 0.4%]	17
Regressions ❌ (secondary)	0.4%	[0.3%, 0.5%]	7
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.3%	[0.1%, 0.4%]	17

Max RSS (memory usage)

Results (primary -2.5%, secondary -3.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-2.5%	[-2.7%, -2.4%]	2
Improvements ✅ (secondary)	-3.1%	[-3.1%, -3.1%]	1
All ❌✅ (primary)	-2.5%	[-2.7%, -2.4%]	2

Cycles

Results (primary 1.6%, secondary 1.3%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	2.2%	[1.4%, 4.7%]	6
Regressions ❌ (secondary)	3.4%	[2.1%, 4.4%]	6
Improvements ✅ (primary)	-1.9%	[-1.9%, -1.9%]	1
Improvements ✅ (secondary)	-1.8%	[-2.6%, -0.7%]	4
All ❌✅ (primary)	1.6%	[-1.9%, 4.7%]	7

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 462.08s -> 462.452s (0.08%)
Artifact size: 372.26 MiB -> 372.20 MiB (-0.02%)

Kobzol · 2025-07-03T07:44:39Z

Might be genuine regressions, although the fast path shouldn't have changed much.

rustbot assigned oli-obk Jun 24, 2025

Kobzol mentioned this pull request Jun 24, 2025

Add query cache hits to detailed query table rust-lang/rustc-perf#2168

Draft

Mark-Simulacrum reviewed Jun 25, 2025

View reviewed changes

Kobzol force-pushed the query-hit branch from 6c541c0 to a39e56e Compare June 25, 2025 08:40

Kobzol changed the title ~~Make recording of query cache hits in self-profiler much cheaper~~ Add new self-profiling event to cheaply aggregate query cache hit counts Jun 25, 2025

Add new self-profile event for aggregating query hit counts

04ff853

Kobzol force-pushed the query-hit branch from a39e56e to 04ff853 Compare June 25, 2025 09:10

Do not store empty cache hit counts

e8fc30e

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 1, 2025

This comment has been minimized.

Sign in to view

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jul 2, 2025

Use portable AtomicU64

b49ca02

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 2, 2025

bors added the merged-by-bors This PR was explicitly merged by bors. label Jul 2, 2025

bors merged commit b94bd12 into rust-lang:master Jul 2, 2025
11 checks passed

rustbot added this to the 1.90.0 milestone Jul 2, 2025

Kobzol deleted the query-hit branch July 2, 2025 14:43

rustbot added the perf-regression Performance regression. label Jul 3, 2025

Kobzol mentioned this pull request Jul 4, 2025

Add support for reading aggregated query cache hit counts rust-lang/measureme#244

Merged

Add new self-profiling event to cheaply aggregate query cache hit counts #142978

Add new self-profiling event to cheaply aggregate query cache hit counts #142978

Uh oh!

Conversation

Kobzol commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kobzol commented Jun 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kobzol commented Jun 25, 2025

Uh oh!

oli-obk commented Jul 1, 2025

Uh oh!

bors commented Jul 1, 2025

Uh oh!

Kobzol commented Jul 1, 2025

Uh oh!

bors commented Jul 2, 2025

Uh oh!

This comment has been minimized.

bors commented Jul 2, 2025

Uh oh!

Kobzol commented Jul 2, 2025

Uh oh!

rust-bors bot commented Jul 2, 2025

Uh oh!

rust-bors bot commented Jul 2, 2025

Uh oh!

Kobzol commented Jul 2, 2025

Uh oh!

bors commented Jul 2, 2025

Uh oh!

bors commented Jul 2, 2025

Uh oh!

bors commented Jul 2, 2025

Uh oh!

Uh oh!

github-actions bot commented Jul 2, 2025

Test differences

Job duration changes

Uh oh!

rust-timer commented Jul 3, 2025

Overall result: ❌ regressions - please read the text below

Uh oh!

Kobzol commented Jul 3, 2025

Uh oh!

Uh oh!

Kobzol commented Jun 24, 2025 •

edited

Loading