Skip to content

Conversation

@BaudhikMalik
Copy link

  • Add _append_event_with_retry method that retries up to 5 times with linear backoff (0.5s, 1s, 1.5s, 2s, 2.5s)
  • Implement 90-second total timeout to prevent indefinite retries
  • Refresh session from storage before each retry to ensure latest state
  • Update _exec_with_plugin to use retry logic for event appending
  • Update _append_new_message_to_session to use retry logic
  • Add time import for timeout calculations

This ensures the Runner handles concurrent session updates gracefully by automatically retrying stale session errors instead of failing immediately.

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):
#1049

  • Closes: #issue_number
  • Related: #issue_number

2. Or, if no issue exists, describe the change:

If applicable, please follow the issue templates to provide as much detail as
possible.

Problem:
A clear and concise description of what the problem is.

When multiple concurrent requests attempt to append events to the same session, the Runner can encounter stale session errors. These errors occur when the session's last_update_time in storage is newer than the session object being used, indicating the session was modified by another process between when it was fetched and when the event was appended.

Previously, these errors would cause the event append operation to fail immediately, requiring manual retries or causing user-facing errors in concurrent scenarios.

Solution:
A clear and concise description of what you want to happen and why you choose
this solution.

  1. Detects stale session errors by checking for "stale session" in the error message
  2. Retries up to 5 times with linear backoff (0.5s, 1s, 1.5s, 2s, 2.5s)
  3. Refreshes the session from storage before each retry to ensure the latest state is used
  4. Enforces a 90-second total timeout to prevent indefinite retries
  5. Logs warnings for visibility into retry attempts and failures
  6. Returns the updated session object after successful append

Testing Plan

Please describe the tests that you ran to verify your changes. This is required
for all PRs that are not small documentation or typo fixes.

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.

Please include a summary of passed pytest results.

Manual End-to-End (E2E) Tests:

Please provide instructions on how to manually test your changes, including any
necessary setup or configuration. Please provide logs or screenshots to help
reviewers better understand the fix.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

Additional context

The retry logic only activates for stale session errors (detected by error message content)

  • Other validation errors from append_event are not retried
  • Session refresh ensures the Runner always uses the latest session state
  • Concurrent session updates should now be handled gracefully
    Add any other context or screenshots about the feature request here.

…sion errors

- Add _append_event_with_retry method that retries up to 5 times with linear backoff (0.5s, 1s, 1.5s, 2s, 2.5s)
- Implement 90-second total timeout to prevent indefinite retries
- Refresh session from storage before each retry to ensure latest state
- Update _exec_with_plugin to use retry logic for event appending
- Update _append_new_message_to_session to use retry logic
- Add time import for timeout calculations

This ensures the Runner handles concurrent session updates gracefully by automatically retrying stale session errors instead of failing immediately.
@google-cla
Copy link

google-cla bot commented Nov 17, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @BaudhikMalik, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the Runner's robustness by introducing a sophisticated retry mechanism for 'stale session' errors. Previously, concurrent updates could lead to immediate failures. The new logic automatically retries event appending with linear backoff and session refreshing, ensuring that the system can gracefully handle race conditions and temporary inconsistencies in session state, thereby improving overall stability and user experience in concurrent environments.

Highlights

  • New Retry Logic: Introduced a new asynchronous method, _append_event_with_retry, to handle ValueError exceptions specifically for 'stale session' errors.
  • Linear Backoff and Retries: The retry mechanism attempts up to 5 times with a linear backoff strategy (0.5s, 1s, 1.5s, 2s, 2.5s) to space out retry attempts.
  • Total Timeout: A 90-second total timeout has been implemented to prevent indefinite retries, ensuring operations do not hang indefinitely.
  • Session Refresh: Before each retry attempt, the session is refreshed from storage to ensure the latest state is used, mitigating issues caused by concurrent updates.
  • Integration into Event Appending: The new retry logic has been integrated into _exec_with_plugin and _append_new_message_to_session to automatically handle stale session errors during event appending.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the core [Component] This issue is related to the core interface and implementation label Nov 17, 2025
@adk-bot
Copy link
Collaborator

adk-bot commented Nov 17, 2025

Response from ADK Triaging Agent

Hello @BaudhikMalik, thank you for creating this PR!

To help reviewers evaluate your contribution, could you please provide the following information as outlined in our contribution guidelines?

  • A summary of the pytest results for the unit tests you've added or updated.
  • Details on the manual end-to-end (E2E) tests you performed, including instructions, logs, or screenshots.

Additionally, it appears the Contributor License Agreement (CLA) is not signed. Please ensure you have signed the CLA so we can proceed with the review.

This information will help us to review your PR more efficiently. Thanks!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces retry logic with backoff for handling stale session errors, which is a valuable addition for improving the runner's resilience in concurrent environments. The implementation correctly identifies stale session errors and attempts to recover by refreshing the session. However, I've identified several critical issues with the current implementation. The retry logic silently fails on timeout or after exhausting all retries, which can lead to data loss. There's also a bug where an updated session object from a successful retry is ignored in one of the call sites. Additionally, there are opportunities to make the error handling and retry strategy more robust. My review includes suggestions to address these points.

Comment on lines +329 to +337
if elapsed_time >= total_timeout:
logger.warning(
'Failed to append event after %d attempts and %.1f seconds: %s',
attempt + 1,
elapsed_time,
error_message,
)
# Return the original session on timeout
return session
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The method returns the original session on timeout, which is a silent failure. This can lead to data loss, as the caller is not notified that the event was not appended. This also contradicts the docstring, which states a ValueError should be raised on failure. The same issue exists when retries are exhausted (lines 379-384) or when the backoff would exceed the timeout (lines 347-352).

An exception should be raised in all failure cases to properly signal that the operation did not complete.

Suggested change
if elapsed_time >= total_timeout:
logger.warning(
'Failed to append event after %d attempts and %.1f seconds: %s',
attempt + 1,
elapsed_time,
error_message,
)
# Return the original session on timeout
return session
if elapsed_time >= total_timeout:
timeout_message = (
f'Failed to append event after {attempt + 1} attempts and'
f' {elapsed_time:.1f} seconds due to total timeout: {error_message}'
)
logger.warning(timeout_message)
raise ValueError(timeout_message) from e

Comment on lines +378 to +384
# All retries exhausted
logger.warning(
'Failed to append event after %d retry attempts, returning original'
' session',
max_retries,
)
return session
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

When all retries are exhausted, the method returns the original session, which is a silent failure. This contradicts the docstring and can lead to data loss. An exception should be raised to signal failure to the caller.

To provide better context, you could capture the last exception from the except block in a variable and then raise a new ValueError from it after the loop finishes.

    # All retries exhausted
    final_error_message = (
        f'Failed to append event after {max_retries} retry attempts.'
    )
    logger.error(final_error_message)
    raise ValueError(final_error_message)

event.branch = function_call.branch

await self.session_service.append_event(session=session, event=event)
await self._append_event_with_retry(session=session, event=event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _append_event_with_retry method returns a potentially new, refreshed session object on success. By ignoring the return value, any updates from refreshing the session during retries are lost. The session object in the calling scope will not reflect the latest state from storage.

The other calls to _append_event_with_retry in _exec_with_plugin correctly update the local session variable (e.g., line 744: session = await ...). The same should be done here.

This will require changing _append_new_message_to_session to return the updated session, and then propagating this returned session up through _handle_new_message and _setup_context_for_new_invocation to ensure the session variable in _run_with_trace is correctly updated.

    return await self._append_event_with_retry(session=session, event=event)

Comment on lines +373 to +376
else:
logger.warning(
'Could not refresh session, using original session for retry'
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If refreshing the session fails (refreshed_session is None), the code logs a warning and proceeds to retry with the old, stale session object. This next attempt is highly likely to fail again with another "stale session" error, wasting time and resources.

If the session cannot be refreshed, it might indicate that it has been deleted or is otherwise inaccessible. It would be more robust to fail fast in this scenario by raising an exception.

        else:
          # If session can't be refreshed, it might have been deleted.
          # Failing fast is better than retrying with a stale session object.
          no_refresh_message = (
              'Could not refresh session, failing fast. The session may have'
              ' been deleted.'
          )
          logger.error(no_refresh_message)
          raise ValueError(no_refresh_message) from e

Comment on lines +339 to +340
# Calculate backoff time (0.5s, 1s, 1.5s, 2s, 2.5s)
backoff_time = base_backoff * (attempt + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PR title mentions "exponential backoff", but the implementation and docstring describe a linear backoff. Exponential backoff is generally a more robust strategy for retries, especially under high contention, as it helps to prevent clients from retrying in synchronized waves.

Consider switching to exponential backoff with jitter to improve the retry mechanism. You would also need to update the docstring and the inline comment to reflect this change.

Suggested change
# Calculate backoff time (0.5s, 1s, 1.5s, 2s, 2.5s)
backoff_time = base_backoff * (attempt + 1)
# Calculate backoff time with exponential backoff (e.g., 0.5s, 1s, 2s, 4s, 8s)
backoff_time = base_backoff * (2**attempt)

@ryanaiagent ryanaiagent self-assigned this Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core [Component] This issue is related to the core interface and implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants