Skip to content

Commit 7dee899

Browse files
anijain2305pytorchmergebot
authored andcommitted
[dynamo][guards] Flush cache to more accurately measure guard overhead (pytorch#154764)
We observed that guard overhead at runtime using profiler traces was higher than reported in this profiling function at the compile time. After investigation, we found that f_locals are already in cache and that was causing the guard overhead to be way smaller while profiling during the compilation. To be more realistic, we flush the cache here. Profiling the guard overhead during compilation (in addition to at runtime) allows faster iteration time, and logging in tlparse and internal databases. Pull Request resolved: pytorch#154764 Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/StrongerXi ghstack dependencies: pytorch#154769
1 parent 409c396 commit 7dee899

File tree

2 files changed

+28
-9
lines changed

2 files changed

+28
-9
lines changed

torch/_dynamo/guards.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2851,12 +2851,14 @@ def make_guard_filter_entry(guard):
28512851
self.guard_manager, output_graph.local_scope
28522852
)
28532853

2854-
# NB for developers: n_iters is chosen to be 50 to achieve
2855-
# statistical significance. If you are working on a guard
2856-
# optimization, it might be a good idea to increase this number for
2857-
# more stabiilty during development.
2854+
# NB for developers: n_iters is chosen to be 1 to prevent excessive
2855+
# increase in compile time. We first do a cache flush to measure the
2856+
# guard latency more accurately. This cache flush is expensive.
2857+
# Note - If you are working on a guard optimization, it might be a
2858+
# good idea to increase this number for more stabiilty during
2859+
# development.
28582860
latency = profile_guard_manager(
2859-
self.guard_manager.root, output_graph.local_scope, 50
2861+
self.guard_manager.root, output_graph.local_scope, 1
28602862
)
28612863
guards_log.debug("Guard eval latency = %s us", f"{latency:.2f}")
28622864
# Note: We use `increment_toplevel` instead of `compilation_metric`

torch/csrc/dynamo/guards.cpp

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5276,23 +5276,40 @@ void install_storage_overlapping_guard(
52765276
/* overlapping= */ false);
52775277
}
52785278

5279+
char flush_cache_by_eviction() {
5280+
constexpr size_t evict_size = 32 * 1024 * 1024;
5281+
std::vector<char> buffer(evict_size, 1);
5282+
5283+
volatile char sink = 0;
5284+
for (size_t i = 0; i < buffer.size(); i += 64) {
5285+
sink ^= buffer[i];
5286+
}
5287+
return sink;
5288+
}
5289+
52795290
double profile_guard_manager(
52805291
RootGuardManager* root,
52815292
py::object f_locals,
52825293
int n_iters) {
52835294
PyObject* locals = f_locals.ptr();
52845295

5285-
// Warmup
5296+
// Warmup to setup fast paths (like dict_tags) for the actual profiling
52865297
for (int i = 0; i < 5; i++) {
52875298
root->check_nopybind(locals);
52885299
}
52895300

5290-
auto start = std::chrono::high_resolution_clock::now();
5301+
std::chrono::duration<double> total_elapsed{0.0};
52915302
for (int i = 0; i < n_iters; i++) {
5303+
// Flush the caches to accurately measure the overhead
5304+
// store into a volatile to prevent optimization
5305+
volatile char dummy = flush_cache_by_eviction();
5306+
(void)dummy;
5307+
5308+
auto start = std::chrono::high_resolution_clock::now();
52925309
root->check_nopybind(locals);
5310+
auto end = std::chrono::high_resolution_clock::now();
5311+
total_elapsed += end - start;
52935312
}
5294-
auto end = std::chrono::high_resolution_clock::now();
5295-
std::chrono::duration<double> total_elapsed = end - start;
52965313

52975314
// Calculate the average time per iteration in microseconds
52985315
return (total_elapsed.count() * 1e6) / n_iters;

0 commit comments

Comments
 (0)