Skip to content

Conversation

@andrewfayres
Copy link
Collaborator

Adding timeouts to the clients. We can expand this idea for the service to service side as we want but I wanted to go ahead and get this part moving.

@Lance-Drane
Copy link
Collaborator

@andrewfayres could you make the target branch candidate-0.9.0 instead of main?

@andrewfayres andrewfayres changed the base branch from main to candidate-0.9.0 October 3, 2025 15:19
Copy link
Collaborator

@Lance-Drane Lance-Drane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a new commit to reflect the changes of the "UserspaceMessage" format, which fixed two of the unit tests for me. However, one of the new unit tests is still failing for me, and the E2E tests are also failing.

Additionally, I thought we also wanted the Service to potentially "fail fast" in the event that the message is being processed after a specific point in time? This work is just to stop the Service from having to process a long-term message, and I think its most important usage is for immediately acknowledging and discarding the message without going through processing.

Otherwise, I think the unit tests themselves do a good job for capturing requirements, and this should at least cover the Client side well, so once all the tests pass I'll go ahead and merge this in.

# Wait to make sure timeout doesn't fire
time.sleep(0.7)

assert len(timeout_called) == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this block of code, I'm currently getting len(timeout_called) == 1, so the function is being called anyways.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being caused by the line: here. Basically, because the test sets the flag _terminate_after_initial_messages, we return early and never really process the message. Looking at the code, there's a few other scenarios that could cause this. When you're back from vacation I think we should have a quick discussion about what we want the behavior to be in these cases. Currently, I lean towards putting the logic to remove the calls from the timeout check at the top before all of the fast exits.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I lean towards putting the logic to remove the calls from the timeout check at the top before all of the fast exits.

This is probably fine. I consider the fast exit usecase itself as somewhat niche (and one which we wouldn't attempt to replicate on the general-purpose orchestrator, which is what most people will end up using), and also one which shouldn't be setting timeouts on any of its own calls, so doing the timeout logic before the fast exit logic shouldn't cause any problems. Realistically I think the tests should probably not even use fast exit and should probably just use simple callbacks, the callbacks themselves could probably just assert that the message is or is not an error.

Comment on lines +293 to +300
# If not in pending requests, it already timed out, so ignore this response
if headers.operation_id in self._pending_requests:
del self._pending_requests[headers.operation_id]
else:
logger.debug(
f'Received response for operation {headers.operation_id} that already timed out, ignoring'
)
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The E2E tests are failing and I believe it's partially related to this block of code, because the existing E2E tests do not explicitly specify a timeout. self._pending_requests is not necessarily the source of truth for requests which haven't timed out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're right. In addition to it being an issue for when a timeout wasn't set, it was an issue for when a client gets restarted. I've changed the logic to account for this. Will push everything after we discuss the above.

Comment on lines +488 to +494
if params.timeout is not None and params.on_timeout is not None:
self._pending_requests[params.operation].append(
{
'timeout': time.time() + params.timeout,
'on_timeout': params.on_timeout,
}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another part of the reason I think the E2E tests are failing. If the user doesn't specify a timeout on a pending request, then by default the request should be handled.

I also don't think that params.on_timeout should explicitly be checked for validity purposes, it should just be an optional function called in the specific usecase a user's function times out. A function should be allowed to timeout even if this callback isn't specified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Fixed and will push once we have a chance to discuss the first issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants