fix(opentelemetry): trace context propagation in process-pool workers #1017

gregbrowndev · 2025-08-03T22:45:16Z

What was changed

This PR fixes a bug in the OpenTelemetry TracingInterceptor affecting sync, multi-process activities. The fix ensures tracing capabilities are possible inside the user's activity implementation, e.g. creating child spans, trace events, log correlation, profile correlation, distributed tracing with other systems, etc.

Unlike async or sync multi-threaded activities, the TracingInterceptor/_ActivityInboundImpl interceptors had not propagated the OTEL trace context, in this case, across the process pool.

Note: Both async and threadpool executors manage the trace context via Python's contextvars.

For the process-pool executor, any data we want to send to the child process must be extracted from contextvars and/or otherwise passed to loop.run_in_executor as pickable arguments to the target _execute_sync_activity function, and then rebuilt into contextvars on the other side.

Since the trace context is created in the TracingInterceptor in the parent process, it would be difficult getting this all the way down to _ActivityInboundImpl where it can be sent to the child process without introducing OpenTelemetry as a core dependency. This change attempts to be as transparent as possible, but may introduce a breaking change (see end of section).

The TracingInterceptor 's inbound activity interceptor now handles the special case for sync, non-threadpool executor activities. It wraps the input.fn in a picklable dataclass that:

captures the trace context for the parent span created in the interceptor / parent process
exposes its own __call__ function that becomes the entrypoint of the subprocess task, which reattaches the trace context before delegating to the original activity function
preserves as much of the original activity function's metadata, using functools.wraps. This is because downstream interceptors, such as the SentryInterceptor in the Python examples (see feat: add example using Sentry V2 SDK samples-python#140), use reflection on the activity attributes, e.g. fn.__name__, fn.__module__.

Tests have been added to verify the fix and I've had this patch running in our production environment for several weeks without any issues.

Breaking Change

As mentioned above, this change may break downstream interceptors that rely on receiving the original activity function handle directly.

Care has been taken to ensure common properties are preserved using functools.wraps, like you would with a decorator. However, without more significant changes to other parts of the SDK, I think this cannot be avoided, since creating a closure function cannot be pickled.

Users would need to ensure any interceptor switched on the function name, fn.__name__, rather than a reference to the real function.

Why?

Users of the SDK's process-pool Worker currently cannot leverage OTEL tracing capabilities inside their own activity implementation. The TracingInterceptor correctly instruments the activity's root span, but further downstream tracing is not properly linked to this parent span. The following is currently broken in sync, multiprocess activities:

creating child spans
attaching trace events / attributes
log correlation (e.g. with experimental OTEL logging SDK)
profile correlation (e.g. with Grafana Pyroscope SDK)
propagating the trace context onwards for distributed tracing with other systems

Checklist

Closes: 669

How was this tested:

Added tests to verify:

child spans in the activity have the correct parent
the wrapped activity preserves the original function's metadata

Manual testing using the OTEL logging SDK in my app shows that logs emitted in the activities are injected with correct trace_id/span_id enabling log-correlation in Grafana/Tempo/Loki. I didn't want to add this to the tests as the logging SDK is still experimental.

Note: testing this was quite tricky, I used a proxy list in the server process manager to access the spans exported from the child process's SpanExporter. I don't expect this would ever be necessary in production code (especially with OpenTelemetry) since all of the OTEL tracing exporters that I've seen a use push-based approach to export spans directly to an OTEL collector or tracing backend directly. (I think I remember seeing a Jaeger guide that indicated scraping traces from an endpoint, but that was for native Jaeger tooling I think). With a push-based exporter, e.g. OTLPTraceExporter, the child process can simply export its spans without needing to consolidate them with the parent process, even while the parent span created in the TracingInterceptor is yet to complete and be exported, the tracing backends expect to receive spans out-of-order.

Any docs updates needed?

Hopefully, no change from users is necessary.

- Add test to show trace context is not available

- This test implementation isn't to be taken as a reference for production. The fixed `TracingInterceptor` works in production, provided you use the `OTLPSpanExporter` or other exporter that pushes traces to a collector or backend, rather than one that pulls traces from the server (if one exists). - Add a custom span exporter to write finished_spans to a list proxy created by the server process manager. This is because we want to test the full trace across the process pool. Again, in production, the child process can just export spans directly to a remote collector. Tracing is designed to handle distributed systems. - Ensure the child process is initialised with its own TracerProvider to avoid different default mp_context behaviours across MacOS and Linux

CLAassistant · 2025-08-03T22:45:23Z

All committers have signed the CLA.

…race-context-propagation

- For some reason, the docstring comparison for the reflection check seemed to fail in Python 3.9 - I shortened the docstring to make it easier to compare in VSCode test output, that seemed to fix the test. Maybe 3.9 doesn't strip leading spaces in the docstring (e.g. like textwrap.dedent)?

gregbrowndev · 2025-08-20T13:31:15Z

@cretz or @dandavison might be best to review the OpenTelemetry changes

cretz · 2025-08-20T14:09:19Z

temporalio/contrib/opentelemetry.py

            return await super().execute_activity(input)


+@dataclasses.dataclass
+class ActivityFnWithTraceContext:


So I understand what's happening here. Basically you have to make a picklable top-level class that can go over the process boundary, because unlike asyncio (which copies contextvars implicitly) and threaded (which we do a copy_context for you), there is no implicit context propagation across process boundaries.

This may not be the last time a process pool user (or even thread pool user), wants to provide initialization code for inside the executor instead of outside where the interceptor code runs. Also, there could be an option where one does this at the executor level. So here are the options as I see it:

Provide an additional, optional field on ExecuteActivityInput called initializer: Optional[Callable] that if set we run inside the executor before the actual activity function

Inside the opentelemetry module here, provide some kind of OpenTelemetryProcessPoolExecutor that extends ProcessPoolExecutor and overrides submit to do basically what the interceptor is doing here

Do as this PR does (but slight changes such as making this class called _PicklableActivityWithTraceContext to clarify specific use and private, and having an opt-out option to return to today's no-propagate behavior)

I can see all three options as reasonable but having their own tradeoffs. I'm leaning towards 1 just so other interceptor implementers can also run code on the other side of the executor without wrapping user functions.

@dandavison or @tconley1428 - any additional thoughts?

After some discussion, we may not be able to confirm/adjust the design in the near term for priority reasons. But as a workaround in the meantime, you can make your own interceptor that runs after the OTel one that does this "copy span into process", at least until we can confer on this PR.

Thanks for the feedback on this @cretz. Patching the behaviour with a second interceptor for the time being would be a bit easier for users than extending TracingInterceptor, like I did in my own project. I'll be sure to contribute it back to the samples repo for others to use. I hope to see this fixed here when priorities allow.

Option 1 sounds like the most correct approach to me too, as it avoids manipulating the user's function. But I think both options 1 and 2 would need to apply a sort of "reverse chain of responsibility" pattern, so multiple interceptors can stack together and not override the previous interceptor's setup/clean logic. The pattern in this PR is extensible in this way, but the functools.wraps part would need to be clearly documented, maybe in the custom interceptor guide, if accepted.

gregbrowndev and others added 5 commits July 20, 2025 16:10

fix(opentelemetry): trace propagation in process pool activities

e08802e

- Add test to show trace context is not available

wip: add impl and try to make test pass

98ed492

wip: refactor test helpers into separate modules

6d703c0

wip: test reflection isn't broken

c9b4b13

gregbrowndev requested a review from a team as a code owner August 3, 2025 22:45

gregbrowndev changed the title ~~Fix/opentelemetry trace context propagation~~ fix(opentelemetry): trace context propagation in process-pool workers Aug 3, 2025

gregbrowndev added 2 commits August 3, 2025 23:49

Merge remote-tracking branch 'upstream/main' into fix/opentelemetry-t…

a203bf7

…race-context-propagation

cretz reviewed Aug 20, 2025

View reviewed changes

This was referenced Aug 23, 2025

[Feature Request] Set OpenTelemetry span status to failed for heartbeat timeout cancellations #1047

Open

[Feature Request] Reclassify Benign Application errors in OpenTelemetry #1041

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(opentelemetry): trace context propagation in process-pool workers #1017

fix(opentelemetry): trace context propagation in process-pool workers #1017

Uh oh!

gregbrowndev commented Aug 3, 2025

Uh oh!

CLAassistant commented Aug 3, 2025 •

edited

Loading

Uh oh!

gregbrowndev commented Aug 20, 2025

Uh oh!

cretz Aug 20, 2025 •

edited

Loading

Uh oh!

cretz Aug 20, 2025

Uh oh!

gregbrowndev Aug 23, 2025

Uh oh!

Uh oh!

fix(opentelemetry): trace context propagation in process-pool workers #1017

Are you sure you want to change the base?

fix(opentelemetry): trace context propagation in process-pool workers #1017

Uh oh!

Conversation

gregbrowndev commented Aug 3, 2025

What was changed

Breaking Change

Why?

Checklist

How was this tested:

Any docs updates needed?

Uh oh!

CLAassistant commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gregbrowndev commented Aug 20, 2025

Uh oh!

cretz Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

gregbrowndev Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CLAassistant commented Aug 3, 2025 •

edited

Loading

cretz Aug 20, 2025 •

edited

Loading