-
Notifications
You must be signed in to change notification settings - Fork 10
Thread filter optim #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread filter optim #238
Conversation
|
🔧 Report generated by pr-comment-scanbuild |
e5bce28 to
0918008
Compare
|
I have reasonable performance on most runs: I'm not sure why some runs still blow up for higher numbers of threads. |
e0ac246 to
2421ba9
Compare
- Reserve padded slots - Introduce a register / unregister to retrieve slots - manage a free list
2421ba9 to
50a8d5f
Compare
|
I did run some comparison of native memory usage with different thread filter implementations - data is in the notebook TL;DR there is no observable increase in the native memory usage (the |
…t/thread_filter_squash
If the TLS cleanup fires before the JVMTI hook, we want to ensure that we don't crash while retrieving the ProfiledThread - Add a check on validity of ProfiledThread
- Start the profiler to ensure we have valid thread objects - add asserts around missing thread object - remove print (replacing with an assert)
…t/thread_filter_squash
6171739 to
e30d88f
Compare
- Fix removal of self in timerloop init it was not using a slotID but a thread ID - Add assertion to find other potential issues
e30d88f to
e78a6b2
Compare
| Java_com_datadoghq_profiler_JavaProfiler_filterThreadRemove0(JNIEnv *env, | ||
| jobject unused) { | ||
| ProfiledThread *current = ProfiledThread::current(); | ||
| if (unlikely(current == nullptr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think assert(current != nllptr) should be sufficient, otherwise, we have a bigger problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this happens on unloading. JVMTI cleanup can be removed before all threads are finished ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the assert for debug builds, though we can keep avoiding crashes for release builds. Feel free to answer if you do not agree.
| return; | ||
| } | ||
| int tid = current->tid(); | ||
| if (unlikely(tid < 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible? or we should just assert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question
I think we are missing the instrumentation of some threads. Could non-java threads call back into java?
It could be nice to have asserts to debug this, though I think I'd prefer the safer path for a prod release.
| return; | ||
| } | ||
| int tid = current->tid(); | ||
| if (unlikely(tid < 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
| private native void stop0() throws IllegalStateException; | ||
| private native String execute0(String command) throws IllegalArgumentException, IllegalStateException, IOException; | ||
|
|
||
| private native void filterThreadAdd0(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that filterThreadAdd0 == filterThread0(true) and filterThreadRemove0 == filterThread0(false). Please remove duplications.
| void collect(std::vector<int> &v); | ||
| private: | ||
| // Optimized slot structure with padding to avoid false sharing | ||
| struct alignas(64) Slot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have definition of DEFAULT_CACHE_LINE_SIZE in dd_arch.h. I would suggest following code for readability and portability.
struct alignas(DEFAULT_CACHE_LINE_SIZE) Slot {
std::atomic<int> value{-1};
char padding[DEFAULT_CACHE_LINE_SIZE - sizeof(value)];
};
| std::atomic<SlotID> _next_index{0}; | ||
| std::unique_ptr<FreeListNode[]> _free_list; | ||
|
|
||
| struct alignas(64) ShardHead { std::atomic<int> head{-1}; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use DEFAULT_CACHE_LINE_SIZE for readability and portability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
| } | ||
| // Try to install it atomically | ||
| ChunkStorage* expected = nullptr; | ||
| if (_chunks[chunk_idx].compare_exchange_strong(expected, new_chunk, std::memory_order_acq_rel)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_order_release should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
|
|
||
| ThreadFilter::SlotID ThreadFilter::registerThread() { | ||
| // If disabled, block new registrations | ||
| if (!_enabled.load(std::memory_order_acquire)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any memory ordering _enabled providing. Could you explain what it releases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the init to be called (so that constructor is finished)
This means other threads might not have the init finished
Though we can always lazily register them later (even if the on thread start path is better)
So this acts as a load barrier. I think it makes sense. Feel free to challenge.
| return; | ||
| } | ||
|
|
||
| ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need memory_order_acquire ordering here to match the release store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as the add, I think we can keep relaxed. Feel free to challenge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you do need std::memory_order_acquire, just as you did in ThreadFilter::accept()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did partial second round reviewing, I think there are many inconsistencies in memory ordering.
| _num_chunks.store(0, std::memory_order_release); | ||
| // Detach and delete chunks | ||
| for (int i = 0; i < kMaxChunks; ++i) { | ||
| ChunkStorage* chunk = _chunks[i].exchange(nullptr, std::memory_order_acq_rel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_order_acquire instead of memory_order_acq_rel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusting
| int slot_idx = slot_id & kChunkMask; | ||
|
|
||
| // Fast path: assume valid slot_id from registerThread() | ||
| ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need memory_order_acquire ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intentional, the idea is that we either have null, or a valid chunk. The impact of missing one trace is acceptable if we don't see the chunk in the current thread (which is pretty unlikely). wdyt ?
| // Fast path: assume valid slot_id from registerThread() | ||
| ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed); | ||
| if (likely(chunk != nullptr)) { | ||
| return chunk->slots[slot_idx].value.load(std::memory_order_acquire) != -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_order_relaxed should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a tradeoff. I hope this is a path where acquire is acceptable. If you think otherwise I can adjust.
|
Thanks @zhengyu123 I really appreciate the thorough review. Apologies if I only have a little time to spend on this every week. |
…t/thread_filter_squash In the current merge, I'm removing the active bitmap. The ActiveBitmap needs to be adjusted to the slot logics. We can adjust this by retrieving the address of the slot. This can be simpler than with the bitmap.
| int slot_idx = slot_id & kChunkMask; | ||
|
|
||
| // Fast path: assume valid slot_id from registerThread() | ||
| ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you do need std::memory_order_acquire, just as you did in ThreadFilter::accept() and match the release from ChunkStorage* expected = nullptr; if (_chunks[chunk_idx].compare_exchange_strong(expected, new_chunk, std::memory_order_release))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we do, but let's measure before we debate. I'll run the test to see if this changes anything on perf.
| return; | ||
| } | ||
|
|
||
| ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you do need std::memory_order_acquire, just as you did in ThreadFilter::accept()
|
@zhengyu123 are we OK with this version ? |
| // Collect thread IDs from the fixed-size table into the main set | ||
| _thread_ids[i][old_index].collect(threads); | ||
| _thread_ids[i][old_index].clear(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Release here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think memory_order_acq_rel is good here for the old_index. Though we might be talking about something else.
| ProfiledThread *current = ProfiledThread::current(); | ||
| int tid = -1; | ||
|
|
||
| if (current != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can current == nullptr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember seeing a crash around this..
Basically JVMTI could unload and delete the profiled thread from under our feet.
|
Notes for future: if we care about the ~100KB, we can reduce the padding. This becomes a tradeoff between memory and risk of false sharing. |
What does this PR do?:
Motivation:
Improve throughput of applications that run on many threads with many context updates.
Additional Notes:
How to test the change?:
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.Unsure? Have a question? Request a review!