Skip to content

Conversation

@timfish
Copy link
Collaborator

@timfish timfish commented Aug 28, 2024

Closes #13415

This PR only modifies the async version of this integration which is used for Node > v19. I tried applying similar changes to the sync integration and I cannot get it to work without causing memory leaks.

@Bruno-DaSilva has been helping me explore different ways to fix a few fundamental issues with the local variables integration. Bruno found a way to write to the error object from the debugger which removes any ambiguity over which variables go with which exception.

This allows us to remove the stack parsing and hashing which we were using previously to match up exceptions.

Rather than write the objectId to the error, I have used this to write the entire local variables array directly to the error object. This completely negates the need to post the local variables from the worker thread which removes any possibility of race conditions. We then later pull the local variables directly from hint.originalException.__SENTRY_ERROR_LOCAL_VARIABLES__.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 28, 2024

size-limit report 📦

⚠️ Warning: Base artifact is not the latest one, because the latest workflow run is not done yet. This may lead to incorrect results. Try to re-run all tests to get up to date results.

Path Size % Change Change
@sentry/browser 22.52 KB +0.02% +4 B 🔺
@sentry/browser (incl. Tracing) 34.78 KB +0.02% +4 B 🔺
@sentry/browser (incl. Tracing, Replay) 71.22 KB +0.01% +5 B 🔺
@sentry/browser (incl. Tracing, Replay) - with treeshaking flags 61.66 KB -4.39% -2.83 KB 🔽
@sentry/browser (incl. Tracing, Replay with Canvas) 75.57 KB +0.01% +4 B 🔺
@sentry/browser (incl. Tracing, Replay, Feedback) 88.29 KB +0.01% +5 B 🔺
@sentry/browser (incl. Tracing, Replay, Feedback, metrics) 90.13 KB +0.01% +6 B 🔺
@sentry/browser (incl. metrics) 26.83 KB +0.02% +4 B 🔺
@sentry/browser (incl. Feedback) 39.6 KB +0.02% +5 B 🔺
@sentry/browser (incl. sendFeedback) 27.19 KB +0.02% +5 B 🔺
@sentry/browser (incl. FeedbackAsync) 31.9 KB +0.02% +4 B 🔺
@sentry/react 25.28 KB +0.02% +5 B 🔺
@sentry/react (incl. Tracing) 37.74 KB +0.02% +6 B 🔺
@sentry/vue 26.67 KB +0.02% +5 B 🔺
@sentry/vue (incl. Tracing) 36.61 KB +0.02% +5 B 🔺
@sentry/svelte 22.65 KB +0.03% +5 B 🔺
CDN Bundle 23.77 KB +0.03% +7 B 🔺
CDN Bundle (incl. Tracing) 36.49 KB +0.02% +6 B 🔺
CDN Bundle (incl. Tracing, Replay) 70.9 KB +0.01% +7 B 🔺
CDN Bundle (incl. Tracing, Replay, Feedback) 76.21 KB +0.01% +7 B 🔺
CDN Bundle - uncompressed 69.63 KB +0.02% +12 B 🔺
CDN Bundle (incl. Tracing) - uncompressed 108.21 KB +0.02% +12 B 🔺
CDN Bundle (incl. Tracing, Replay) - uncompressed 219.86 KB +0.01% +12 B 🔺
CDN Bundle (incl. Tracing, Replay, Feedback) - uncompressed 233.05 KB +0.01% +12 B 🔺
@sentry/nextjs (client) 37.51 KB +0.01% +3 B 🔺
@sentry/sveltekit (client) 35.35 KB +0.02% +6 B 🔺
@sentry/node 114.63 KB -1.54% -1.79 KB 🔽
@sentry/node - without tracing 88.24 KB -1.98% -1.78 KB 🔽
@sentry/aws-serverless 97.64 KB -1.82% -1.8 KB 🔽
@sentry/browser - with treeshaking flags 21.3 KB added added

View base workflow run

@timfish
Copy link
Collaborator Author

timfish commented Aug 28, 2024

Ok so with the async debugger via worker, this results in a memory leak. I guess this needs some cleanup.

@Bruno-DaSilva
Copy link

Do we have to worry about object size with a large number of errors? Or a bunch of long lived error objects? Just trying to think if there's any edge cases with storing the local vars on the error objects themselves.

If we forsee any problem with that, then similar to the original approach it might be worth still using an extra datastructure to actually store the local variable state -- and force a maximum size (like with the LRU) that's configurable. And then have some sort of pointer to it on the error object instead (like a unique id).


re: memory leaks, I wonder if there's references to the error objects lingering around somewhere. I doubt the inspector api would write objects that are unable to be garbage collected, but I suppose it's possible if for some reason the inspector is holding references to the objects it made? 🤷

@timfish
Copy link
Collaborator Author

timfish commented Aug 28, 2024

Do we have to worry about object size with a large number of errors?

It might be worth adding a limit regardless of how we're passing the variables.

re: memory leaks, I wonder if there's references

I've managed to fix the memory leak for the async version (Node > v19) by calling Runtime.releaseObject after the debugger has resumed. Still looking into the sync one which does leak, just less than the async one did.

@Bruno-DaSilva
Copy link

Bruno-DaSilva commented Aug 28, 2024

I've managed to fix the memory leak for the async version (Node > v19) by calling Runtime.releaseObject after the debugger has resumed.

Niiiiiiiice, this makes sense!


parentPort?.postMessage({ exceptionHash, frames });
await session.post('Runtime.callFunctionOn', {
functionDeclaration: `function() { this.${LOCAL_VARIABLES_KEY} = ${JSON.stringify(frames)}; }`,

Check warning

Code scanning / CodeQL

Improper code sanitization

Code construction depends on an [improperly sanitized value](1).
@timfish
Copy link
Collaborator Author

timfish commented Aug 28, 2024

Oh it stopped the memory leak but also stopped it from working.

Back to the drawing board 😂

@timfish

This comment was marked as outdated.

Comment on lines 120 to 124
await session.post('Runtime.callFunctionOn', {
functionDeclaration: `function() { this.${LOCAL_VARIABLES_KEY} = ${JSON.stringify(frames)}; }`,
silent: true,
objectId: errorObjectId,
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one question - now that we are doing more inspector work while the main application is paused, do we have an idea of how much time awaiting on this Runtime.callFunctionOn takes?

I'm concerned that we might block the main thread (and by extension, the whole event loop) for long enough to be noticeable. I don't know how inspector performance scales with application size either, so it's possible it'd be more impactful on large codebases?

At minimum, is there an approach you recommend that would let us track how long we spend 'paused' before we resume execution? In my normal apps I'd maintain a prometheus metric here or something.
And further, if there's an approach that might let us do the session.post asynchronously and then have the sentry code wait for the data? Increases complexity but might fix this problem. Not worth doing before we know the performance impact of awaiting here, unless there's a good approach you can think of.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory test code catches an error and captures it's local variables once every millisecond. console.time suggests that the debugger is paused for 0.3-0.4ms on my machine to do this.

Whether this has an impact on your app will likely depend on a lot of variables. We have rate limiting for caught exceptions and maxExceptionsPerSecond defaults to 50. Rate limiting initially disables capture for 5 seconds and every subsequent rate limit trigger doubles this wait time. We have no rate limiting for uncaught exceptions.

Copy link

@Bruno-DaSilva Bruno-DaSilva Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. If we're pausing on the order of a low single digit milliseconds I'm sure for 99% of use cases this is fine. at 50 exceptions per second and 1ms per exception that blocks the event loop for only 50ms per second.
At 10ms, that's 500ms blocked.
At 20ms that's an entire second blocked. Which would even prevent us from hitting the 50 exceptions per second 😅

So yeah as long as we keep execution time low then it's more than fine.
Is it maybe worth adding a WARN log line if any single error execution takes longer than eg. 5ms? That way if people are debugging their slow applications they would be able to see it's the fault of the debugger?

In general, I just want to find a way to avoid the case where turning this feature on introduces a huge performance problem in some edge case and application developers have no idea why. That can lead to a lot of frustration when they eventually realize it's from an observability SDK. If we can expose runtime information somehow, or if we're confident that won't ever be a problem, that solves my worry :)

Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it might be a good idea to change our rate-limiting to be based around total blocked time rather than number of paused events

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the happy middleground is some kind of dynamic backpressure detector, so we can adjust rate-limiting according to the volume of exceptions accordingly.

In our server-side SDKs (not node yet though), we do basic heuristic analysis to downsample outgoing events based on relative volume. Spec here: https://develop.sentry.dev/sdk/telemetry/traces/backpressure/. I assume we can take the larger general concept and apply it here.

Let's give this a try afterwards, we can scope it out further.

@Bruno-DaSilva
Copy link

I fixed the memory leak while keping the feature actually functioning by delaying the cleanup.

I'm glad that we've found a way that works. Something doesn't smell right to me about this though...
The v8 inspector docs are quite light on what exactly these functions do (maybe worth perusing the v8 internal source)... but I would assume the Runtime.releaseObject simply releases the pointer the inspector is holding to the underlying object, to allow it to be garbage collected. Why would releasing that pointer cause the data we tried to save on the object to not be saved?
Surely awaiting on Runtime.callFunctionOn should wait for the state to fully clear/write to fully finish? I know that's not what we saw but something smells fishy here.

What do you think is going on? You probably have a better idea than I do.

@timfish
Copy link
Collaborator Author

timfish commented Aug 29, 2024

The v8 inspector docs are quite light on what exactly these functions do

Not only what the functions do, but how all the commands should be used together! You can get a reasonable idea by looking what others have done via Github search.

What do you think is going on? You probably have a better idea than I do.

Unfortunately not. Most that I've learnt has been through a lot of trial and error. The Chrome debugger has a protocol inspector so you can see the commands it's sending and learn a bit from that.

From what I've observed, Runtime.callFunctionOn modifies the object and then retains a reference to it which stops it from being GC'd. Runtime.releaseObject does release the reference but also appears to revert the changes. Without digging into the v8 debugger code, I don't really know why it does this or why you'd want it to work like this.

@Bruno-DaSilva
Copy link

Bruno-DaSilva commented Aug 30, 2024

Hey @timfish I just tried this branch in my local with and without delaying releaseObject and I get variables for both.

Can you double check on your end?

@timfish
Copy link
Collaborator Author

timfish commented Sep 1, 2024

I'll be back looking at this on Monday evening on my way back from vacation.

The e2e test failures in CI suggested that the delay was required but I'll double check it.

@timfish
Copy link
Collaborator Author

timfish commented Sep 4, 2024

I just tried this branch in my local with and without delaying releaseObject and I get variables for both

While we get local variables without the delay in releasing, we still get a memory leak. It looks like releaseObject needs to be called after the debugger has resumed to avoid the memory leak.

@Bruno-DaSilva
Copy link

I just tried this branch in my local with and without delaying releaseObject and I get variables for both

While we get local variables without the delay in releasing, we still get a memory leak. It looks like releaseObject needs to be called after the debugger has resumed to avoid the memory leak.

Nice find. That sounds incredibly annoying to have found. This is awesome though, it makes the whole local variables integration much simpler and much more robust 💪

@timfish
Copy link
Collaborator Author

timfish commented Sep 4, 2024

Plot twist...

While setImmediate is enough for Node v20, for v22 it releases too early and local variables are missing 😬

@timfish timfish marked this pull request as ready for review September 4, 2024 09:27
@Bruno-DaSilva
Copy link

Plot twist...

While setImmediate is enough for Node v20, for v22 it releases too early and local variables are missing 😬

Honestly, that is equal parts interesting and terrifying. Feels like we're relying on undefined behaviour somehow 😭

@Bruno-DaSilva
Copy link

Bruno-DaSilva commented Sep 4, 2024

You know, if it were easy to do bisecting to the exact v8 engine version that introduces that behaviour change would probably reveal a lot about the internals of nodejs/the inspector/object releasing.

I suspect the best way to do that is to checkout nodejs@ v22.x and bisect through deps/v8. The node v20.x is on 11.3.244.4 and node v22.x is on 12.4.254.14.

But uhh, that's pretty time consuming for a probably-useless-except-for-knowledge outcome lol. Maaaybe it'd be a bug in v8, which would be cool i guess 🤷 . Or maybe this is just undefined behaviour with no guarantees because their api docs don't specify anything >.<

More just saying this out loud to myself. Would be neat but won't pay the bills :'(

@timfish
Copy link
Collaborator Author

timfish commented Sep 4, 2024

Honestly, that is equal parts interesting and terrifying. Feels like we're relying on undefined behaviour somehow

Yeah, feels like this entire debug API is prone to race conditions.

The node v20.x is on 11.3.244.4 and node v22.x is on 12.4.254.14

Could try more node versions to find an exact version where it changes. If it ends up on a v8 upgrade, the changeset will be huge 😭

Maaaybe it'd be a bug in v8, which would be cool i guess 🤷

Even if we find a bug in v8, it's unlikely a fix would make it into an LTS Node version anyway. We'd need to cherry pick that fix and apply it to older v8 versions which might be too much work.

It could also be a bug in the inspector module although feels less likely.

@timfish timfish requested review from AbhiPrasad, Bruno-DaSilva and mydea and removed request for Bruno-DaSilva September 4, 2024 12:00
Copy link

@Bruno-DaSilva Bruno-DaSilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak to the design wrt. the overall @sentry/js conventions, but from a conceptual point of view this LGTM. One comment about debug logging.

Comment on lines +56 to +67
function addLocalVariablesToEvent(event: Event, hint: EventHint): Event {
if (
hint.originalException &&
typeof hint.originalException === 'object' &&
LOCAL_VARIABLES_KEY in hint.originalException &&
Array.isArray(hint.originalException[LOCAL_VARIABLES_KEY])
) {
for (const exception of event.exception?.values || []) {
addLocalVariablesToException(exception, hint.originalException[LOCAL_VARIABLES_KEY]);
}

hint.originalException[LOCAL_VARIABLES_KEY] = undefined;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in both here and the worker, is it worth adding some kind of debug logging for when the sdk is in debug mode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can validate it pretty easily by inspecting the event in beforeSend, so I think we don't explicitly need logging for every event, but @timfish if you think it's a good idea go for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the discussion over whether some of our logging here should be behind debug: true or regular console.*. If we hit rate limiting maybe this should always be logged rather than only when debug is enabled? Users are unlikely to have debug logging enabled in production but it's likely useful to know when you're hitting rate limiting.

@timfish timfish requested review from AbhiPrasad and removed request for AbhiPrasad September 5, 2024 15:02
Copy link
Member

@AbhiPrasad AbhiPrasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally catching up after vacation, thanks for all the help @Bruno-DaSilva!! I'll start reading through all the issues and leaving comments now 😄.

Could try more node versions to find an exact version where it changes. If it ends up on a v8 upgrade, the changeset will be huge 😭

If we can get a minimal reproduction going we should write up a Node.js ticket, I can help ping people to take a look. I don't think we should investigate it further until we can get help from them.

}

// Only hash the 10 most recent frames (ie. the last 10)
return frames.slice(-10).reduce((acc, frame) => `${acc},${frame.function},${frame.lineno},${frame.colno}`, '');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should filter out system frames here, so basically either

  1. pick all the frames (if less than 10)
  2. pick 10 most recent frames (after filtering out system frames)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is for the sync debugger has simply been copied from ./common.ts as it's no longer common between the two implementations. I probably should have left it there! The sync debugger has seen no changes other than the moving of this code.

Comment on lines +56 to +67
function addLocalVariablesToEvent(event: Event, hint: EventHint): Event {
if (
hint.originalException &&
typeof hint.originalException === 'object' &&
LOCAL_VARIABLES_KEY in hint.originalException &&
Array.isArray(hint.originalException[LOCAL_VARIABLES_KEY])
) {
for (const exception of event.exception?.values || []) {
addLocalVariablesToException(exception, hint.originalException[LOCAL_VARIABLES_KEY]);
}

hint.originalException[LOCAL_VARIABLES_KEY] = undefined;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can validate it pretty easily by inspecting the event in beforeSend, so I think we don't explicitly need logging for every event, but @timfish if you think it's a good idea go for it.

Comment on lines 120 to 124
await session.post('Runtime.callFunctionOn', {
functionDeclaration: `function() { this.${LOCAL_VARIABLES_KEY} = ${JSON.stringify(frames)}; }`,
silent: true,
objectId: errorObjectId,
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the happy middleground is some kind of dynamic backpressure detector, so we can adjust rate-limiting according to the volume of exceptions accordingly.

In our server-side SDKs (not node yet though), we do basic heuristic analysis to downsample outgoing events based on relative volume. Spec here: https://develop.sentry.dev/sdk/telemetry/traces/backpressure/. I assume we can take the larger general concept and apply it here.

Let's give this a try afterwards, we can scope it out further.

@AbhiPrasad
Copy link
Member

Gonna go ahead and merge this in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LocalVariables integration - race condition processing of cached frames

3 participants