Skip to content

Conversation

gregbrowndev
Copy link

What was changed

This PR fixes a bug in the OpenTelemetry TracingInterceptor affecting sync, multi-process activities. The fix ensures tracing capabilities are possible inside the user's activity implementation, e.g. creating child spans, trace events, log correlation, profile correlation, distributed tracing with other systems, etc.

Unlike async or sync multi-threaded activities, the TracingInterceptor/_ActivityInboundImpl interceptors had not propagated the OTEL trace context, in this case, across the process pool.

Note: Both async and threadpool executors manage the trace context via Python's contextvars.

For the process-pool executor, any data we want to send to the child process must be extracted from contextvars and/or otherwise passed to loop.run_in_executor as pickable arguments to the target _execute_sync_activity function, and then rebuilt into contextvars on the other side.

Since the trace context is created in the TracingInterceptor in the parent process, it would be difficult getting this all the way down to _ActivityInboundImpl where it can be sent to the child process without introducing OpenTelemetry as a core dependency. This change attempts to be as transparent as possible, but may introduce a breaking change (see end of section).

The TracingInterceptor 's inbound activity interceptor now handles the special case for sync, non-threadpool executor activities. It wraps the input.fn in a picklable dataclass that:

  • captures the trace context for the parent span created in the interceptor / parent process
  • exposes its own __call__ function that becomes the entrypoint of the subprocess task, which reattaches the trace context before delegating to the original activity function
  • preserves as much of the original activity function's metadata, using functools.wraps. This is because downstream interceptors, such as the SentryInterceptor in the Python examples (see feat: add example using Sentry V2 SDK samples-python#140), use reflection on the activity attributes, e.g. fn.__name__, fn.__module__.

Tests have been added to verify the fix and I've had this patch running in our production environment for several weeks without any issues.

Breaking Change

As mentioned above, this change may break downstream interceptors that rely on receiving the original activity function handle directly.

Care has been taken to ensure common properties are preserved using functools.wraps, like you would with a decorator. However, without more significant changes to other parts of the SDK, I think this cannot be avoided, since creating a closure function cannot be pickled.

Users would need to ensure any interceptor switched on the function name, fn.__name__, rather than a reference to the real function.

Why?

Users of the SDK's process-pool Worker currently cannot leverage OTEL tracing capabilities inside their own activity implementation. The TracingInterceptor correctly instruments the activity's root span, but further downstream tracing is not properly linked to this parent span. The following is currently broken in sync, multiprocess activities:

  • creating child spans
  • attaching trace events / attributes
  • log correlation (e.g. with experimental OTEL logging SDK)
  • profile correlation (e.g. with Grafana Pyroscope SDK)
  • propagating the trace context onwards for distributed tracing with other systems

Checklist

Closes: 669

How was this tested:

Added tests to verify:

  • child spans in the activity have the correct parent
  • the wrapped activity preserves the original function's metadata

Manual testing using the OTEL logging SDK in my app shows that logs emitted in the activities are injected with correct trace_id/span_id enabling log-correlation in Grafana/Tempo/Loki. I didn't want to add this to the tests as the logging SDK is still experimental.

Note: testing this was quite tricky, I used a proxy list in the server process manager to access the spans exported from the child process's SpanExporter. I don't expect this would ever be necessary in production code (especially with OpenTelemetry) since all of the OTEL tracing exporters that I've seen a use push-based approach to export spans directly to an OTEL collector or tracing backend directly. (I think I remember seeing a Jaeger guide that indicated scraping traces from an endpoint, but that was for native Jaeger tooling I think). With a push-based exporter, e.g. OTLPTraceExporter, the child process can simply export its spans without needing to consolidate them with the parent process, even while the parent span created in the TracingInterceptor is yet to complete and be exported, the tracing backends expect to receive spans out-of-order.

Any docs updates needed?

Hopefully, no change from users is necessary.

gregbrowndev and others added 5 commits July 20, 2025 16:10
- Add test to show trace context is not available
- This test implementation isn't to be taken as a reference for production.
  The fixed `TracingInterceptor` works in production, provided you use
  the `OTLPSpanExporter` or other exporter that pushes traces to a collector
  or backend, rather than one that pulls traces from the server (if one exists).
- Add a custom span exporter to write finished_spans to a list proxy
  created by the server process manager. This is because we want to test the
  full trace across the process pool. Again, in production, the child process
  can just export spans directly to a remote collector. Tracing is designed
  to handle distributed systems.
- Ensure the child process is initialised with its own TracerProvider to
  avoid different default mp_context behaviours across MacOS and Linux
@gregbrowndev gregbrowndev requested a review from a team as a code owner August 3, 2025 22:45
@CLAassistant
Copy link

CLAassistant commented Aug 3, 2025

CLA assistant check
All committers have signed the CLA.

@gregbrowndev gregbrowndev changed the title Fix/opentelemetry trace context propagation fix(opentelemetry): trace context propagation in process-pool workers Aug 3, 2025
- For some reason, the docstring comparison for the reflection check seemed to fail in Python 3.9
- I shortened the docstring to make it easier to compare in VSCode test output, that seemed to fix
  the test. Maybe 3.9 doesn't strip leading spaces in the docstring (e.g. like textwrap.dedent)?
@gregbrowndev
Copy link
Author

@cretz or @dandavison might be best to review the OpenTelemetry changes

return await super().execute_activity(input)


@dataclasses.dataclass
class ActivityFnWithTraceContext:
Copy link
Member

@cretz cretz Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I understand what's happening here. Basically you have to make a picklable top-level class that can go over the process boundary, because unlike asyncio (which copies contextvars implicitly) and threaded (which we do a copy_context for you), there is no implicit context propagation across process boundaries.

This may not be the last time a process pool user (or even thread pool user), wants to provide initialization code for inside the executor instead of outside where the interceptor code runs. Also, there could be an option where one does this at the executor level. So here are the options as I see it:

  1. Provide an additional, optional field on ExecuteActivityInput called initializer: Optional[Callable] that if set we run inside the executor before the actual activity function
  2. Inside the opentelemetry module here, provide some kind of OpenTelemetryProcessPoolExecutor that extends ProcessPoolExecutor and overrides submit to do basically what the interceptor is doing here
  3. Do as this PR does (but slight changes such as making this class called _PicklableActivityWithTraceContext to clarify specific use and private, and having an opt-out option to return to today's no-propagate behavior)

I can see all three options as reasonable but having their own tradeoffs. I'm leaning towards 1 just so other interceptor implementers can also run code on the other side of the executor without wrapping user functions.

@dandavison or @tconley1428 - any additional thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussion, we may not be able to confirm/adjust the design in the near term for priority reasons. But as a workaround in the meantime, you can make your own interceptor that runs after the OTel one that does this "copy span into process", at least until we can confer on this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback on this @cretz. Patching the behaviour with a second interceptor for the time being would be a bit easier for users than extending TracingInterceptor, like I did in my own project. I'll be sure to contribute it back to the samples repo for others to use. I hope to see this fixed here when priorities allow.

Option 1 sounds like the most correct approach to me too, as it avoids manipulating the user's function. But I think both options 1 and 2 would need to apply a sort of "reverse chain of responsibility" pattern, so multiple interceptors can stack together and not override the previous interceptor's setup/clean logic. The pattern in this PR is extensible in this way, but the functools.wraps part would need to be clearly documented, maybe in the custom interceptor guide, if accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Support / provide guidance on using OpenTelemetry logging + metrics SDKs with process-pool workers
3 participants