-
Notifications
You must be signed in to change notification settings - Fork 2.4k
fix(runners): Add retry logic with exponential backoff for stale session errors #3586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…sion errors - Add _append_event_with_retry method that retries up to 5 times with linear backoff (0.5s, 1s, 1.5s, 2s, 2.5s) - Implement 90-second total timeout to prevent indefinite retries - Refresh session from storage before each retry to ensure latest state - Update _exec_with_plugin to use retry logic for event appending - Update _append_new_message_to_session to use retry logic - Add time import for timeout calculations This ensures the Runner handles concurrent session updates gracefully by automatically retrying stale session errors instead of failing immediately.
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Summary of ChangesHello @BaudhikMalik, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the Runner's robustness by introducing a sophisticated retry mechanism for 'stale session' errors. Previously, concurrent updates could lead to immediate failures. The new logic automatically retries event appending with linear backoff and session refreshing, ensuring that the system can gracefully handle race conditions and temporary inconsistencies in session state, thereby improving overall stability and user experience in concurrent environments. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Response from ADK Triaging Agent Hello @BaudhikMalik, thank you for creating this PR! To help reviewers evaluate your contribution, could you please provide the following information as outlined in our contribution guidelines?
Additionally, it appears the Contributor License Agreement (CLA) is not signed. Please ensure you have signed the CLA so we can proceed with the review. This information will help us to review your PR more efficiently. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces retry logic with backoff for handling stale session errors, which is a valuable addition for improving the runner's resilience in concurrent environments. The implementation correctly identifies stale session errors and attempts to recover by refreshing the session. However, I've identified several critical issues with the current implementation. The retry logic silently fails on timeout or after exhausting all retries, which can lead to data loss. There's also a bug where an updated session object from a successful retry is ignored in one of the call sites. Additionally, there are opportunities to make the error handling and retry strategy more robust. My review includes suggestions to address these points.
| if elapsed_time >= total_timeout: | ||
| logger.warning( | ||
| 'Failed to append event after %d attempts and %.1f seconds: %s', | ||
| attempt + 1, | ||
| elapsed_time, | ||
| error_message, | ||
| ) | ||
| # Return the original session on timeout | ||
| return session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method returns the original session on timeout, which is a silent failure. This can lead to data loss, as the caller is not notified that the event was not appended. This also contradicts the docstring, which states a ValueError should be raised on failure. The same issue exists when retries are exhausted (lines 379-384) or when the backoff would exceed the timeout (lines 347-352).
An exception should be raised in all failure cases to properly signal that the operation did not complete.
| if elapsed_time >= total_timeout: | |
| logger.warning( | |
| 'Failed to append event after %d attempts and %.1f seconds: %s', | |
| attempt + 1, | |
| elapsed_time, | |
| error_message, | |
| ) | |
| # Return the original session on timeout | |
| return session | |
| if elapsed_time >= total_timeout: | |
| timeout_message = ( | |
| f'Failed to append event after {attempt + 1} attempts and' | |
| f' {elapsed_time:.1f} seconds due to total timeout: {error_message}' | |
| ) | |
| logger.warning(timeout_message) | |
| raise ValueError(timeout_message) from e |
| # All retries exhausted | ||
| logger.warning( | ||
| 'Failed to append event after %d retry attempts, returning original' | ||
| ' session', | ||
| max_retries, | ||
| ) | ||
| return session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When all retries are exhausted, the method returns the original session, which is a silent failure. This contradicts the docstring and can lead to data loss. An exception should be raised to signal failure to the caller.
To provide better context, you could capture the last exception from the except block in a variable and then raise a new ValueError from it after the loop finishes.
# All retries exhausted
final_error_message = (
f'Failed to append event after {max_retries} retry attempts.'
)
logger.error(final_error_message)
raise ValueError(final_error_message)| event.branch = function_call.branch | ||
|
|
||
| await self.session_service.append_event(session=session, event=event) | ||
| await self._append_event_with_retry(session=session, event=event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _append_event_with_retry method returns a potentially new, refreshed session object on success. By ignoring the return value, any updates from refreshing the session during retries are lost. The session object in the calling scope will not reflect the latest state from storage.
The other calls to _append_event_with_retry in _exec_with_plugin correctly update the local session variable (e.g., line 744: session = await ...). The same should be done here.
This will require changing _append_new_message_to_session to return the updated session, and then propagating this returned session up through _handle_new_message and _setup_context_for_new_invocation to ensure the session variable in _run_with_trace is correctly updated.
return await self._append_event_with_retry(session=session, event=event)| else: | ||
| logger.warning( | ||
| 'Could not refresh session, using original session for retry' | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If refreshing the session fails (refreshed_session is None), the code logs a warning and proceeds to retry with the old, stale session object. This next attempt is highly likely to fail again with another "stale session" error, wasting time and resources.
If the session cannot be refreshed, it might indicate that it has been deleted or is otherwise inaccessible. It would be more robust to fail fast in this scenario by raising an exception.
else:
# If session can't be refreshed, it might have been deleted.
# Failing fast is better than retrying with a stale session object.
no_refresh_message = (
'Could not refresh session, failing fast. The session may have'
' been deleted.'
)
logger.error(no_refresh_message)
raise ValueError(no_refresh_message) from e| # Calculate backoff time (0.5s, 1s, 1.5s, 2s, 2.5s) | ||
| backoff_time = base_backoff * (attempt + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR title mentions "exponential backoff", but the implementation and docstring describe a linear backoff. Exponential backoff is generally a more robust strategy for retries, especially under high contention, as it helps to prevent clients from retrying in synchronized waves.
Consider switching to exponential backoff with jitter to improve the retry mechanism. You would also need to update the docstring and the inline comment to reflect this change.
| # Calculate backoff time (0.5s, 1s, 1.5s, 2s, 2.5s) | |
| backoff_time = base_backoff * (attempt + 1) | |
| # Calculate backoff time with exponential backoff (e.g., 0.5s, 1s, 2s, 4s, 8s) | |
| backoff_time = base_backoff * (2**attempt) |
This ensures the Runner handles concurrent session updates gracefully by automatically retrying stale session errors instead of failing immediately.
Please ensure you have read the contribution guide before creating a pull request.
Link to Issue or Description of Change
1. Link to an existing issue (if applicable):
#1049
2. Or, if no issue exists, describe the change:
If applicable, please follow the issue templates to provide as much detail as
possible.
Problem:
A clear and concise description of what the problem is.
When multiple concurrent requests attempt to append events to the same session, the Runner can encounter stale session errors. These errors occur when the session's
last_update_timein storage is newer than the session object being used, indicating the session was modified by another process between when it was fetched and when the event was appended.Previously, these errors would cause the event append operation to fail immediately, requiring manual retries or causing user-facing errors in concurrent scenarios.
Solution:
A clear and concise description of what you want to happen and why you choose
this solution.
Testing Plan
Please describe the tests that you ran to verify your changes. This is required
for all PRs that are not small documentation or typo fixes.
Unit Tests:
Please include a summary of passed
pytestresults.Manual End-to-End (E2E) Tests:
Please provide instructions on how to manually test your changes, including any
necessary setup or configuration. Please provide logs or screenshots to help
reviewers better understand the fix.
Checklist
Additional context
The retry logic only activates for stale session errors (detected by error message content)
append_eventare not retriedAdd any other context or screenshots about the feature request here.