-
Notifications
You must be signed in to change notification settings - Fork 115
fix(opentelemetry): trace context propagation in process-pool workers #1017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(opentelemetry): trace context propagation in process-pool workers #1017
Conversation
- Add test to show trace context is not available
- This test implementation isn't to be taken as a reference for production. The fixed `TracingInterceptor` works in production, provided you use the `OTLPSpanExporter` or other exporter that pushes traces to a collector or backend, rather than one that pulls traces from the server (if one exists). - Add a custom span exporter to write finished_spans to a list proxy created by the server process manager. This is because we want to test the full trace across the process pool. Again, in production, the child process can just export spans directly to a remote collector. Tracing is designed to handle distributed systems. - Ensure the child process is initialised with its own TracerProvider to avoid different default mp_context behaviours across MacOS and Linux
…race-context-propagation
- For some reason, the docstring comparison for the reflection check seemed to fail in Python 3.9 - I shortened the docstring to make it easier to compare in VSCode test output, that seemed to fix the test. Maybe 3.9 doesn't strip leading spaces in the docstring (e.g. like textwrap.dedent)?
@cretz or @dandavison might be best to review the OpenTelemetry changes |
return await super().execute_activity(input) | ||
|
||
|
||
@dataclasses.dataclass | ||
class ActivityFnWithTraceContext: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I understand what's happening here. Basically you have to make a picklable top-level class that can go over the process boundary, because unlike asyncio (which copies contextvars
implicitly) and threaded (which we do a copy_context
for you), there is no implicit context propagation across process boundaries.
This may not be the last time a process pool user (or even thread pool user), wants to provide initialization code for inside the executor instead of outside where the interceptor code runs. Also, there could be an option where one does this at the executor level. So here are the options as I see it:
- Provide an additional, optional field on
ExecuteActivityInput
calledinitializer: Optional[Callable]
that if set we run inside the executor before the actual activity function - Inside the
opentelemetry
module here, provide some kind ofOpenTelemetryProcessPoolExecutor
that extendsProcessPoolExecutor
and overridessubmit
to do basically what the interceptor is doing here - Do as this PR does (but slight changes such as making this class called
_PicklableActivityWithTraceContext
to clarify specific use and private, and having an opt-out option to return to today's no-propagate behavior)
I can see all three options as reasonable but having their own tradeoffs. I'm leaning towards 1 just so other interceptor implementers can also run code on the other side of the executor without wrapping user functions.
@dandavison or @tconley1428 - any additional thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some discussion, we may not be able to confirm/adjust the design in the near term for priority reasons. But as a workaround in the meantime, you can make your own interceptor that runs after the OTel one that does this "copy span into process", at least until we can confer on this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback on this @cretz. Patching the behaviour with a second interceptor for the time being would be a bit easier for users than extending TracingInterceptor
, like I did in my own project. I'll be sure to contribute it back to the samples repo for others to use. I hope to see this fixed here when priorities allow.
Option 1 sounds like the most correct approach to me too, as it avoids manipulating the user's function. But I think both options 1 and 2 would need to apply a sort of "reverse chain of responsibility" pattern, so multiple interceptors can stack together and not override the previous interceptor's setup/clean logic. The pattern in this PR is extensible in this way, but the functools.wraps
part would need to be clearly documented, maybe in the custom interceptor guide, if accepted.
What was changed
This PR fixes a bug in the OpenTelemetry
TracingInterceptor
affecting sync, multi-process activities. The fix ensures tracing capabilities are possible inside the user's activity implementation, e.g. creating child spans, trace events, log correlation, profile correlation, distributed tracing with other systems, etc.Unlike async or sync multi-threaded activities, the
TracingInterceptor
/_ActivityInboundImpl
interceptors had not propagated the OTEL trace context, in this case, across the process pool.For the process-pool executor, any data we want to send to the child process must be extracted from contextvars and/or otherwise passed to
loop.run_in_executor
as pickable arguments to the target_execute_sync_activity
function, and then rebuilt into contextvars on the other side.Since the trace context is created in the
TracingInterceptor
in the parent process, it would be difficult getting this all the way down to_ActivityInboundImpl
where it can be sent to the child process without introducing OpenTelemetry as a core dependency. This change attempts to be as transparent as possible, but may introduce a breaking change (see end of section).The
TracingInterceptor
's inbound activity interceptor now handles the special case for sync, non-threadpool executor activities. It wraps theinput.fn
in a picklable dataclass that:__call__
function that becomes the entrypoint of the subprocess task, which reattaches the trace context before delegating to the original activity functionfunctools.wraps
. This is because downstream interceptors, such as theSentryInterceptor
in the Python examples (see feat: add example using Sentry V2 SDK samples-python#140), use reflection on the activity attributes, e.g.fn.__name__
,fn.__module__
.Tests have been added to verify the fix and I've had this patch running in our production environment for several weeks without any issues.
Breaking Change
As mentioned above, this change may break downstream interceptors that rely on receiving the original activity function handle directly.
Care has been taken to ensure common properties are preserved using
functools.wraps
, like you would with a decorator. However, without more significant changes to other parts of the SDK, I think this cannot be avoided, since creating a closure function cannot be pickled.Users would need to ensure any interceptor switched on the function name,
fn.__name__
, rather than a reference to the real function.Why?
Users of the SDK's process-pool Worker currently cannot leverage OTEL tracing capabilities inside their own activity implementation. The
TracingInterceptor
correctly instruments the activity's root span, but further downstream tracing is not properly linked to this parent span. The following is currently broken in sync, multiprocess activities:Checklist
Closes: 669
How was this tested:
Added tests to verify:
Manual testing using the OTEL logging SDK in my app shows that logs emitted in the activities are injected with correct
trace_id
/span_id
enabling log-correlation in Grafana/Tempo/Loki. I didn't want to add this to the tests as the logging SDK is still experimental.Note: testing this was quite tricky, I used a proxy list in the server process manager to access the spans exported from the child process's
SpanExporter
. I don't expect this would ever be necessary in production code (especially with OpenTelemetry) since all of the OTEL tracing exporters that I've seen a use push-based approach to export spans directly to an OTEL collector or tracing backend directly. (I think I remember seeing a Jaeger guide that indicated scraping traces from an endpoint, but that was for native Jaeger tooling I think). With a push-based exporter, e.g.OTLPTraceExporter
, the child process can simply export its spans without needing to consolidate them with the parent process, even while the parent span created in theTracingInterceptor
is yet to complete and be exported, the tracing backends expect to receive spans out-of-order.Any docs updates needed?
Hopefully, no change from users is necessary.