Thread filter optim #238

r1viollet · 2025-07-07T07:51:19Z

What does this PR do?:

Reserve padded slots
Introduce a register / unregister to retrieve slots
manage a free list

Motivation:

Improve throughput of applications that run on many threads with many context updates.

Additional Notes:

How to test the change?:

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.
JIRA: [JIRA-XXXX]

Unsure? Have a question? Request a review!

github-actions · 2025-07-07T07:51:51Z

🔧 Report generated by pr-comment-cppcheck

CppCheck Report

Errors (2)

Warnings (8)

The class 'RecordingBuffer' defines member function with name 'flushIfNeeded' also defined in its parent class 'Buffer'.
Member variable 'CodeCache::_image_base' is not initialized in the copy constructor.
Member variable 'CodeCache::_image_base' is not assigned a value in 'CodeCache::operator='.
Member variable 'JitWriteProtection::_restore' is not initialized in the constructor.
Member variable 'JitWriteProtection::_restore' is not initialized in the constructor.
Either the condition 'uc!=nullptr' is redundant or there is possible null pointer dereference: uc.
Obsolete function 'alloca' called.
Member variable 'ElfParser::_vaddr_diff' is not initialized in the constructor.

Style Violations (306)

github-actions · 2025-07-07T07:52:38Z

🔧 Report generated by pr-comment-scanbuild

r1viollet · 2025-07-07T11:48:41Z

I have reasonable performance on most runs:

Benchmark                                                       (command)  (skipResults)  (workload)  Mode  Cnt    Score   Error  Units
ThreadFilterBenchmark.threadFilterStress01  cpu=100us,wall=100us,filter=1           true           0  avgt         0.039          us/op
ThreadFilterBenchmark.threadFilterStress01  cpu=100us,wall=100us,filter=1           true           7  avgt         0.041          us/op
ThreadFilterBenchmark.threadFilterStress01  cpu=100us,wall=100us,filter=1           true       70000  avgt       111.094          us/op
ThreadFilterBenchmark.threadFilterStress02  cpu=100us,wall=100us,filter=1           true           0  avgt         0.132          us/op
ThreadFilterBenchmark.threadFilterStress02  cpu=100us,wall=100us,filter=1           true           7  avgt         0.139          us/op
ThreadFilterBenchmark.threadFilterStress02  cpu=100us,wall=100us,filter=1           true       70000  avgt       108.666          us/op
ThreadFilterBenchmark.threadFilterStress04  cpu=100us,wall=100us,filter=1           true           0  avgt         0.258          us/op
ThreadFilterBenchmark.threadFilterStress04  cpu=100us,wall=100us,filter=1           true           7  avgt         0.278          us/op
ThreadFilterBenchmark.threadFilterStress04  cpu=100us,wall=100us,filter=1           true       70000  avgt       118.940          us/op
ThreadFilterBenchmark.threadFilterStress08  cpu=100us,wall=100us,filter=1           true           0  avgt         0.624          us/op
ThreadFilterBenchmark.threadFilterStress08  cpu=100us,wall=100us,filter=1           true           7  avgt         0.646          us/op
ThreadFilterBenchmark.threadFilterStress08  cpu=100us,wall=100us,filter=1           true       70000  avgt       160.170          us/op
ThreadFilterBenchmark.threadFilterStress16  cpu=100us,wall=100us,filter=1           true           0  avgt         1.780          us/op
ThreadFilterBenchmark.threadFilterStress16  cpu=100us,wall=100us,filter=1           true           7  avgt         2.288          us/op
ThreadFilterBenchmark.threadFilterStress16  cpu=100us,wall=100us,filter=1           true       70000  avgt       221.987          us/op

I'm not sure why some runs still blow up for higher numbers of threads.

ddprof-lib/src/main/cpp/profiler.cpp

ddprof-lib/src/main/cpp/threadFilter.cpp

r1viollet · 2025-07-10T15:54:56Z

CppCheck Report

Errors (2)

Warnings (8)

The class 'RecordingBuffer' defines member function with name 'flushIfNeeded' also defined in its parent class 'Buffer'.
Member variable 'CodeCache::_image_base' is not initialized in the copy constructor.
Member variable 'CodeCache::_image_base' is not assigned a value in 'CodeCache::operator='.
Member variable 'JitWriteProtection::_restore' is not initialized in the constructor.
Member variable 'JitWriteProtection::_restore' is not initialized in the constructor.
Either the condition 'uc!=nullptr' is redundant or there is possible null pointer dereference: uc.
Obsolete function 'alloca' called.
Member variable 'ElfParser::_vaddr_diff' is not initialized in the constructor.

Style Violations (305)

- Reserve padded slots - Introduce a register / unregister to retrieve slots - manage a free list

jbachorik · 2025-07-24T11:26:08Z

I did run some comparison of native memory usage with different thread filter implementations - data is in the notebook

TL;DR there is no observable increase in the native memory usage (the UNDEFINED category). Anyway, it would be useful to have an extra counter for the ThreadIDTable utilization.

…t/thread_filter_squash

If the TLS cleanup fires before the JVMTI hook, we want to ensure that we don't crash while retrieving the ProfiledThread - Add a check on validity of ProfiledThread

ddprof-lib/src/main/cpp/threadFilter.cpp

- Start the profiler to ensure we have valid thread objects - add asserts around missing thread object - remove print (replacing with an assert)

…t/thread_filter_squash

r1viollet · 2025-08-21T07:52:03Z

CppCheck Report

Errors (2)

Warnings (8)

The class 'RecordingBuffer' defines member function with name 'flushIfNeeded' also defined in its parent class 'Buffer'.
Member variable 'CodeCache::_image_base' is not initialized in the copy constructor.
Member variable 'CodeCache::_image_base' is not assigned a value in 'CodeCache::operator='.
Member variable 'JitWriteProtection::_restore' is not initialized in the constructor.
Member variable 'JitWriteProtection::_restore' is not initialized in the constructor.
Either the condition 'uc!=nullptr' is redundant or there is possible null pointer dereference: uc.
Obsolete function 'alloca' called.
Member variable 'ElfParser::_vaddr_diff' is not initialized in the constructor.

Style Violations (305)

- Fix removal of self in timerloop init it was not using a slotID but a thread ID - Add assertion to find other potential issues

ddprof-lib/src/main/cpp/javaApi.cpp

zhengyu123 · 2025-08-21T18:01:57Z

ddprof-lib/src/main/cpp/javaApi.cpp

+Java_com_datadoghq_profiler_JavaProfiler_filterThreadRemove0(JNIEnv *env,
+                                                             jobject unused) {
+  ProfiledThread *current = ProfiledThread::current();
+  if (unlikely(current == nullptr)) {


I think assert(current != nllptr) should be sufficient, otherwise, we have a bigger problem.

I think this happens on unloading. JVMTI cleanup can be removed before all threads are finished ?

We have the assert for debug builds, though we can keep avoiding crashes for release builds. Feel free to answer if you do not agree.

zhengyu123 · 2025-08-21T18:03:17Z

ddprof-lib/src/main/cpp/javaApi.cpp

+    return;
+  }
+  int tid = current->tid();
+  if (unlikely(tid < 0)) {


Is it possible? or we should just assert

Good question
I think we are missing the instrumentation of some threads. Could non-java threads call back into java?
It could be nice to have asserts to debug this, though I think I'd prefer the safer path for a prod release.

ddprof-lib/src/main/cpp/javaApi.cpp

zhengyu123 · 2025-08-21T18:04:36Z

ddprof-lib/src/main/cpp/javaApi.cpp

+    return;
+  }
+  int tid = current->tid();
+  if (unlikely(tid < 0)) {


Same as above

zhengyu123 · 2025-08-21T18:12:55Z

ddprof-lib/src/main/java/com/datadoghq/profiler/JavaProfiler.java

    private native void stop0() throws IllegalStateException;
    private native String execute0(String command) throws IllegalArgumentException, IllegalStateException, IOException;
+
+    private native void filterThreadAdd0();


It looks like that filterThreadAdd0 == filterThread0(true) and filterThreadRemove0 == filterThread0(false). Please remove duplications.

zhengyu123 · 2025-08-21T18:37:49Z

ddprof-lib/src/main/cpp/threadFilter.h

-  void collect(std::vector<int> &v);
+private:
+    // Optimized slot structure with padding to avoid false sharing
+    struct alignas(64) Slot {


We have definition of DEFAULT_CACHE_LINE_SIZE in dd_arch.h. I would suggest following code for readability and portability.

struct alignas(DEFAULT_CACHE_LINE_SIZE) Slot { std::atomic<int> value{-1}; char padding[DEFAULT_CACHE_LINE_SIZE - sizeof(value)]; };

zhengyu123 · 2025-08-21T18:39:43Z

ddprof-lib/src/main/cpp/threadFilter.h

+    std::atomic<SlotID> _next_index{0};
+    std::unique_ptr<FreeListNode[]> _free_list;
+
+    struct alignas(64) ShardHead { std::atomic<int> head{-1}; };


Use DEFAULT_CACHE_LINE_SIZE for readability and portability.

zhengyu123 · 2025-08-21T19:00:56Z

ddprof-lib/src/main/cpp/threadFilter.cpp

-      }
+    // Try to install it atomically
+    ChunkStorage* expected = nullptr;
+    if (_chunks[chunk_idx].compare_exchange_strong(expected, new_chunk, std::memory_order_acq_rel)) {


memory_order_release should be sufficient.

zhengyu123 · 2025-08-21T19:26:40Z

ddprof-lib/src/main/cpp/threadFilter.cpp

+
+ThreadFilter::SlotID ThreadFilter::registerThread() {
+    // If disabled, block new registrations
+    if (!_enabled.load(std::memory_order_acquire)) {


I don't see any memory ordering _enabled providing. Could you explain what it releases

We want the init to be called (so that constructor is finished)
This means other threads might not have the init finished
Though we can always lazily register them later (even if the on thread start path is better)
So this acts as a load barrier. I think it makes sense. Feel free to challenge.

zhengyu123 · 2025-08-21T19:45:28Z

ddprof-lib/src/main/cpp/threadFilter.cpp

+        return;
+    }
+
+    ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed);


I think you need memory_order_acquire ordering here to match the release store.

same as the add, I think we can keep relaxed. Feel free to challenge.

I think you do need std::memory_order_acquire, just as you did in ThreadFilter::accept()

zhengyu123

I did partial second round reviewing, I think there are many inconsistencies in memory ordering.

zhengyu123 · 2025-08-21T20:09:28Z

ddprof-lib/src/main/cpp/threadFilter.cpp

+    _num_chunks.store(0, std::memory_order_release);
+    // Detach and delete chunks
+    for (int i = 0; i < kMaxChunks; ++i) {
+        ChunkStorage* chunk = _chunks[i].exchange(nullptr, std::memory_order_acq_rel);


memory_order_acquire instead of memory_order_acq_rel

zhengyu123 · 2025-08-21T20:12:24Z

ddprof-lib/src/main/cpp/threadFilter.cpp

+    int slot_idx = slot_id & kChunkMask;
+
+    // Fast path: assume valid slot_id from registerThread()
+    ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed);


Need memory_order_acquire ordering

This was intentional, the idea is that we either have null, or a valid chunk. The impact of missing one trace is acceptable if we don't see the chunk in the current thread (which is pretty unlikely). wdyt ?

zhengyu123 · 2025-08-21T20:13:05Z

ddprof-lib/src/main/cpp/threadFilter.cpp

+    // Fast path: assume valid slot_id from registerThread()
+    ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed);
+    if (likely(chunk != nullptr)) {
+        return chunk->slots[slot_idx].value.load(std::memory_order_acquire) != -1;


memory_order_relaxed should be sufficient.

I think this is a tradeoff. I hope this is a path where acquire is acceptable. If you think otherwise I can adjust.

r1viollet · 2025-09-04T09:34:40Z

Thanks @zhengyu123 I really appreciate the thorough review. Apologies if I only have a little time to spend on this every week.

…t/thread_filter_squash In the current merge, I'm removing the active bitmap. The ActiveBitmap needs to be adjusted to the slot logics. We can adjust this by retrieving the address of the slot. This can be simpler than with the bitmap.

ddprof-lib/src/main/cpp/javaApi.cpp

zhengyu123 · 2025-09-05T15:24:21Z

ddprof-lib/src/main/cpp/threadFilter.cpp

+    int slot_idx = slot_id & kChunkMask;
+
+    // Fast path: assume valid slot_id from registerThread()
+    ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed);


I think you do need std::memory_order_acquire, just as you did in ThreadFilter::accept() and match the release from ChunkStorage* expected = nullptr; if (_chunks[chunk_idx].compare_exchange_strong(expected, new_chunk, std::memory_order_release))

I'm not sure we do, but let's measure before we debate. I'll run the test to see if this changes anything on perf.

zhengyu123 · 2025-09-05T15:25:06Z

ddprof-lib/src/main/cpp/threadFilter.cpp

+        return;
+    }
+
+    ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed);


I think you do need std::memory_order_acquire, just as you did in ThreadFilter::accept()

…t/thread_filter_squash

r1viollet · 2025-09-11T11:46:13Z

@zhengyu123 are we OK with this version ?

zhengyu123 · 2025-09-09T12:28:17Z

ddprof-lib/src/main/cpp/flightRecorder.cpp

+    // Collect thread IDs from the fixed-size table into the main set
+    _thread_ids[i][old_index].collect(threads);
+    _thread_ids[i][old_index].clear();
+  }


Release here?

I think memory_order_acq_rel is good here for the old_index. Though we might be talking about something else.

zhengyu123 · 2025-09-09T12:32:18Z

ddprof-lib/src/main/cpp/profiler.cpp

+  ProfiledThread *current = ProfiledThread::current();
+  int tid = -1;
+
+  if (current != nullptr) {


Can current == nullptr?

I remember seeing a crash around this..
Basically JVMTI could unload and delete the profiled thread from under our feet.

r1viollet · 2025-09-11T14:56:03Z

Notes for future: if we care about the ~100KB, we can reduce the padding. This becomes a tradeoff between memory and risk of false sharing.

r1viollet force-pushed the r1viollet/thread_filter_squash branch 3 times, most recently from e5bce28 to 0918008 Compare July 7, 2025 11:45

r1viollet mentioned this pull request Jul 7, 2025

Refactor thread filter mechanisms #209

Closed

3 tasks

jbachorik force-pushed the r1viollet/thread_filter_squash branch 2 times, most recently from e0ac246 to 2421ba9 Compare July 10, 2025 12:48

r1viollet commented Jul 10, 2025

View reviewed changes

ddprof-lib/src/main/cpp/profiler.cpp Show resolved Hide resolved

r1viollet commented Jul 10, 2025

View reviewed changes

ddprof-lib/src/main/cpp/threadFilter.cpp Show resolved Hide resolved

r1viollet and others added 4 commits July 21, 2025 16:56

Thread filter optim

94b2559

- Reserve padded slots - Introduce a register / unregister to retrieve slots - manage a free list

Add an automatic register in case we failed to register the thread

51cb97f

Exterminate the last remnants of false sharing

cc02c1e

Minor tweaks

50a8d5f

jbachorik force-pushed the r1viollet/thread_filter_squash branch from 2421ba9 to 50a8d5f Compare July 21, 2025 14:57

Merge cleanup

30d32c0

jbachorik and others added 3 commits July 24, 2025 21:43

Merge branch 'main' into r1viollet/thread_filter_squash

bf23309

Merge branch 'main' of github.com:DataDog/java-profiler into r1violle…

90651f5

…t/thread_filter_squash

Adjust ThreadEnd hook

28e23ee

If the TLS cleanup fires before the JVMTI hook, we want to ensure that we don't crash while retrieving the ProfiledThread - Add a check on validity of ProfiledThread

zhengyu123 reviewed Aug 12, 2025

View reviewed changes

ddprof-lib/src/main/cpp/threadFilter.cpp Show resolved Hide resolved

zhengyu123 reviewed Aug 12, 2025

View reviewed changes

ddprof-lib/src/main/cpp/threadFilter.cpp Show resolved Hide resolved

zhengyu123 reviewed Aug 12, 2025

View reviewed changes

ddprof-lib/src/main/cpp/threadFilter.cpp Outdated Show resolved Hide resolved

r1viollet added 3 commits August 19, 2025 18:05

Profiler thread - Ensure we init before swap

3d31cc7

Thread filter bench

dfd44de

- Start the profiler to ensure we have valid thread objects - add asserts around missing thread object - remove print (replacing with an assert)

Merge branch 'main' of github.com:DataDog/java-profiler into r1violle…

1e6efe4

…t/thread_filter_squash

r1viollet marked this pull request as ready for review August 21, 2025 07:53

r1viollet force-pushed the r1viollet/thread_filter_squash branch from 6171739 to e30d88f Compare August 21, 2025 14:05

ThreadFilter optim - fixes

e78a6b2

- Fix removal of self in timerloop init it was not using a slotID but a thread ID - Add assertion to find other potential issues

r1viollet force-pushed the r1viollet/thread_filter_squash branch from e30d88f to e78a6b2 Compare August 21, 2025 14:08

r1viollet commented Aug 21, 2025

View reviewed changes

ddprof-lib/src/main/cpp/javaApi.cpp Show resolved Hide resolved

zhengyu123 reviewed Aug 21, 2025

View reviewed changes

ddprof-lib/src/main/cpp/javaApi.cpp Show resolved Hide resolved

zhengyu123 reviewed Aug 21, 2025

View reviewed changes

ddprof-lib/src/main/cpp/javaApi.cpp

return;

}

int tid = current->tid();

if (unlikely(tid < 0)) {

Copy link

Contributor

zhengyu123 Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

zhengyu123 reviewed Aug 21, 2025

View reviewed changes

zhengyu123 requested changes Aug 21, 2025

View reviewed changes

r1viollet added 2 commits September 4, 2025 18:09

PR review adjustments, following Zhengyu's comments

a637ba6

r1viollet commented Sep 5, 2025

View reviewed changes

ddprof-lib/src/main/cpp/javaApi.cpp Show resolved Hide resolved

zhengyu123 requested changes Sep 5, 2025

View reviewed changes

r1viollet added 2 commits September 9, 2025 19:06

Merge branch 'main' of github.com:DataDog/java-profiler into r1violle…

b5fba80

…t/thread_filter_squash

Thread filter - Adjust memory ordering following discussion with Zhengyu

cc170ee

zhengyu123 approved these changes Sep 11, 2025

View reviewed changes

jbachorik approved these changes Sep 11, 2025

View reviewed changes

r1viollet merged commit 41fcf55 into main Sep 11, 2025
94 checks passed

r1viollet deleted the r1viollet/thread_filter_squash branch September 11, 2025 14:49

github-actions bot added this to the 1.32.0 milestone Sep 11, 2025

zhengyu123 mentioned this pull request Sep 22, 2025

Bump ddprof to 1.32.0 DataDog/dd-trace-java#9584

Merged

Thread filter optim #238

Thread filter optim #238

Uh oh!

Conversation

r1viollet commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CppCheck Report

Errors (2)

Warnings (8)

Style Violations (306)

Uh oh!

github-actions bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r1viollet commented Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

r1viollet commented Jul 10, 2025 • edited by dd-octo-sts bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CppCheck Report

Errors (2)

Warnings (8)

Style Violations (305)

Uh oh!

jbachorik commented Jul 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

r1viollet commented Aug 21, 2025 • edited by dd-octo-sts bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CppCheck Report

Errors (2)

Warnings (8)

Style Violations (305)

Uh oh!

Uh oh!

zhengyu123 Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengyu123 Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengyu123 Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r1viollet commented Jul 7, 2025 •

edited

Loading

github-actions bot commented Jul 7, 2025 •

edited

Loading

github-actions bot commented Jul 7, 2025 •

edited

Loading

r1viollet commented Jul 10, 2025 •

edited by dd-octo-sts bot

Loading

r1viollet commented Aug 21, 2025 •

edited by dd-octo-sts bot

Loading

zhengyu123 Aug 21, 2025 •

edited

Loading

zhengyu123 Aug 21, 2025 •

edited

Loading

zhengyu123 Aug 21, 2025 •

edited

Loading