fix(runners): Add retry logic with exponential backoff for stale session errors #3586

BaudhikMalik · 2025-11-17T19:10:43Z

Add _append_event_with_retry method that retries up to 5 times with linear backoff (0.5s, 1s, 1.5s, 2s, 2.5s)
Implement 90-second total timeout to prevent indefinite retries
Refresh session from storage before each retry to ensure latest state
Update _exec_with_plugin to use retry logic for event appending
Update _append_new_message_to_session to use retry logic
Add time import for timeout calculations

This ensures the Runner handles concurrent session updates gracefully by automatically retrying stale session errors instead of failing immediately.

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):
#1049

Closes: #issue_number
Related: #issue_number

2. Or, if no issue exists, describe the change:

If applicable, please follow the issue templates to provide as much detail as
possible.

Problem:
A clear and concise description of what the problem is.

When multiple concurrent requests attempt to append events to the same session, the Runner can encounter stale session errors. These errors occur when the session's last_update_time in storage is newer than the session object being used, indicating the session was modified by another process between when it was fetched and when the event was appended.

Previously, these errors would cause the event append operation to fail immediately, requiring manual retries or causing user-facing errors in concurrent scenarios.

Solution:
A clear and concise description of what you want to happen and why you choose
this solution.

Detects stale session errors by checking for "stale session" in the error message
Retries up to 5 times with linear backoff (0.5s, 1s, 1.5s, 2s, 2.5s)
Refreshes the session from storage before each retry to ensure the latest state is used
Enforces a 90-second total timeout to prevent indefinite retries
Logs warnings for visibility into retry attempts and failures
Returns the updated session object after successful append

Testing Plan

Please describe the tests that you ran to verify your changes. This is required
for all PRs that are not small documentation or typo fixes.

Unit Tests:

I have added or updated unit tests for my change.
All unit tests pass locally.

Please include a summary of passed pytest results.

Manual End-to-End (E2E) Tests:

Please provide instructions on how to manually test your changes, including any
necessary setup or configuration. Please provide logs or screenshots to help
reviewers better understand the fix.

Checklist

I have read the CONTRIBUTING.md document.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
I have manually tested my changes end-to-end.
Any dependent changes have been merged and published in downstream modules.

Additional context

The retry logic only activates for stale session errors (detected by error message content)

Other validation errors from append_event are not retried
Session refresh ensures the Runner always uses the latest session state
Concurrent session updates should now be handled gracefully
Add any other context or screenshots about the feature request here.

…sion errors - Add _append_event_with_retry method that retries up to 5 times with linear backoff (0.5s, 1s, 1.5s, 2s, 2.5s) - Implement 90-second total timeout to prevent indefinite retries - Refresh session from storage before each retry to ensure latest state - Update _exec_with_plugin to use retry logic for event appending - Update _append_new_message_to_session to use retry logic - Add time import for timeout calculations This ensures the Runner handles concurrent session updates gracefully by automatically retrying stale session errors instead of failing immediately.

google-cla · 2025-11-17T19:10:47Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gemini-code-assist · 2025-11-17T19:11:02Z

Summary of Changes

Hello @BaudhikMalik, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the Runner's robustness by introducing a sophisticated retry mechanism for 'stale session' errors. Previously, concurrent updates could lead to immediate failures. The new logic automatically retries event appending with linear backoff and session refreshing, ensuring that the system can gracefully handle race conditions and temporary inconsistencies in session state, thereby improving overall stability and user experience in concurrent environments.

Highlights

New Retry Logic: Introduced a new asynchronous method, _append_event_with_retry, to handle ValueError exceptions specifically for 'stale session' errors.
Linear Backoff and Retries: The retry mechanism attempts up to 5 times with a linear backoff strategy (0.5s, 1s, 1.5s, 2s, 2.5s) to space out retry attempts.
Total Timeout: A 90-second total timeout has been implemented to prevent indefinite retries, ensuring operations do not hang indefinitely.
Session Refresh: Before each retry attempt, the session is refreshed from storage to ensure the latest state is used, mitigating issues caused by concurrent updates.
Integration into Event Appending: The new retry logic has been integrated into _exec_with_plugin and _append_new_message_to_session to automatically handle stale session errors during event appending.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

adk-bot · 2025-11-17T19:11:47Z

Response from ADK Triaging Agent

Hello @BaudhikMalik, thank you for creating this PR!

To help reviewers evaluate your contribution, could you please provide the following information as outlined in our contribution guidelines?

A summary of the pytest results for the unit tests you've added or updated.
Details on the manual end-to-end (E2E) tests you performed, including instructions, logs, or screenshots.

Additionally, it appears the Contributor License Agreement (CLA) is not signed. Please ensure you have signed the CLA so we can proceed with the review.

This information will help us to review your PR more efficiently. Thanks!

gemini-code-assist

Code Review

This pull request introduces retry logic with backoff for handling stale session errors, which is a valuable addition for improving the runner's resilience in concurrent environments. The implementation correctly identifies stale session errors and attempts to recover by refreshing the session. However, I've identified several critical issues with the current implementation. The retry logic silently fails on timeout or after exhausting all retries, which can lead to data loss. There's also a bug where an updated session object from a successful retry is ignored in one of the call sites. Additionally, there are opportunities to make the error handling and retry strategy more robust. My review includes suggestions to address these points.

gemini-code-assist · 2025-11-17T19:12:56Z

src/google/adk/runners.py

+        if elapsed_time >= total_timeout:
+          logger.warning(
+              'Failed to append event after %d attempts and %.1f seconds: %s',
+              attempt + 1,
+              elapsed_time,
+              error_message,
+          )
+          # Return the original session on timeout
+          return session


The method returns the original session on timeout, which is a silent failure. This can lead to data loss, as the caller is not notified that the event was not appended. This also contradicts the docstring, which states a ValueError should be raised on failure. The same issue exists when retries are exhausted (lines 379-384) or when the backoff would exceed the timeout (lines 347-352).

An exception should be raised in all failure cases to properly signal that the operation did not complete.

Suggested change

if elapsed_time >= total_timeout:

logger.warning(

'Failed to append event after %d attempts and %.1f seconds: %s',

attempt + 1,

elapsed_time,

error_message,

)

# Return the original session on timeout

return session

if elapsed_time >= total_timeout:

timeout_message = (

f'Failed to append event after {attempt + 1} attempts and'

f' {elapsed_time:.1f} seconds due to total timeout: {error_message}'

)

logger.warning(timeout_message)

raise ValueError(timeout_message) from e

gemini-code-assist · 2025-11-17T19:12:57Z

src/google/adk/runners.py

+    # All retries exhausted
+    logger.warning(
+        'Failed to append event after %d retry attempts, returning original'
+        ' session',
+        max_retries,
+    )
+    return session


When all retries are exhausted, the method returns the original session, which is a silent failure. This contradicts the docstring and can lead to data loss. An exception should be raised to signal failure to the caller.

To provide better context, you could capture the last exception from the except block in a variable and then raise a new ValueError from it after the loop finishes.

# All retries exhausted final_error_message = ( f'Failed to append event after {max_retries} retry attempts.' ) logger.error(final_error_message) raise ValueError(final_error_message)

gemini-code-assist · 2025-11-17T19:12:57Z

src/google/adk/runners.py

      event.branch = function_call.branch

-    await self.session_service.append_event(session=session, event=event)
+    await self._append_event_with_retry(session=session, event=event)


The _append_event_with_retry method returns a potentially new, refreshed session object on success. By ignoring the return value, any updates from refreshing the session during retries are lost. The session object in the calling scope will not reflect the latest state from storage.

The other calls to _append_event_with_retry in _exec_with_plugin correctly update the local session variable (e.g., line 744: session = await ...). The same should be done here.

This will require changing _append_new_message_to_session to return the updated session, and then propagating this returned session up through _handle_new_message and _setup_context_for_new_invocation to ensure the session variable in _run_with_trace is correctly updated.

return await self._append_event_with_retry(session=session, event=event)

gemini-code-assist · 2025-11-17T19:12:57Z

src/google/adk/runners.py

+        else:
+          logger.warning(
+              'Could not refresh session, using original session for retry'
+          )


If refreshing the session fails (refreshed_session is None), the code logs a warning and proceeds to retry with the old, stale session object. This next attempt is highly likely to fail again with another "stale session" error, wasting time and resources.

If the session cannot be refreshed, it might indicate that it has been deleted or is otherwise inaccessible. It would be more robust to fail fast in this scenario by raising an exception.

else: # If session can't be refreshed, it might have been deleted. # Failing fast is better than retrying with a stale session object. no_refresh_message = ( 'Could not refresh session, failing fast. The session may have' ' been deleted.' ) logger.error(no_refresh_message) raise ValueError(no_refresh_message) from e

gemini-code-assist · 2025-11-17T19:12:57Z

src/google/adk/runners.py

+        # Calculate backoff time (0.5s, 1s, 1.5s, 2s, 2.5s)
+        backoff_time = base_backoff * (attempt + 1)


The PR title mentions "exponential backoff", but the implementation and docstring describe a linear backoff. Exponential backoff is generally a more robust strategy for retries, especially under high contention, as it helps to prevent clients from retrying in synchronized waves.

Consider switching to exponential backoff with jitter to improve the retry mechanism. You would also need to update the docstring and the inline comment to reflect this change.

Suggested change

# Calculate backoff time (0.5s, 1s, 1.5s, 2s, 2.5s)

backoff_time = base_backoff * (attempt + 1)

# Calculate backoff time with exponential backoff (e.g., 0.5s, 1s, 2s, 4s, 8s)

backoff_time = base_backoff * (2**attempt)

adk-bot added the core [Component] This issue is related to the core interface and implementation label Nov 17, 2025

gemini-code-assist bot reviewed Nov 17, 2025

View reviewed changes

ryanaiagent self-assigned this Nov 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(runners): Add retry logic with exponential backoff for stale session errors #3586

fix(runners): Add retry logic with exponential backoff for stale session errors #3586

Uh oh!

BaudhikMalik commented Nov 17, 2025

Uh oh!

google-cla bot commented Nov 17, 2025

Uh oh!

gemini-code-assist bot commented Nov 17, 2025

Uh oh!

adk-bot commented Nov 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Uh oh!

gemini-code-assist bot Nov 17, 2025

Uh oh!

gemini-code-assist bot Nov 17, 2025

Uh oh!

gemini-code-assist bot Nov 17, 2025

Uh oh!

gemini-code-assist bot Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Calculate backoff time (0.5s, 1s, 1.5s, 2s, 2.5s)
		backoff_time = base_backoff * (attempt + 1)

fix(runners): Add retry logic with exponential backoff for stale session errors #3586

Are you sure you want to change the base?

fix(runners): Add retry logic with exponential backoff for stale session errors #3586

Uh oh!

Conversation

BaudhikMalik commented Nov 17, 2025

Link to Issue or Description of Change

Testing Plan

Checklist

Additional context

Uh oh!

google-cla bot commented Nov 17, 2025

Uh oh!

gemini-code-assist bot commented Nov 17, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

adk-bot commented Nov 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants