Improve OpenAI error handling #918
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes proper exception handling for OpenAI model provider to correctly distinguish between context window overflow and rate limiting errors. Previously, the OpenAI model provider did not handle these errors, causing them to bubble up as raw OpenAI SDK exceptions instead of being converted to the SDK's standardized exception types.
Key Changes
stream()
andstructured_output()
methodscontext_length_exceeded
) →ContextWindowOverflowException
rate_limit_exceeded
) →ModelThrottledException
OpenAI Error Types
OpenAI has two separate token limits that cause different error types requiring different handling:
1. Context Length Limit (400 BadRequestError)
context_length_exceeded
2. Rate Limit - TPM (429 RateLimitError)
Why This Matters
Large context requests can be rejected at the rate limiting layer before reaching context length validation. The TPM limit acts as a "first line of defense" - if a single request would consume more tokens than the per-minute allowance, it gets blocked immediately as a rate limit violation rather than proceeding to context validation.
This means even with models supporting large context windows (like GPT-4o's 128k), users are effectively limited by their TPM quota for single large requests.
Implementation Decision: All Rate Limits as Throttling
Important: We cannot reliably differentiate between token rate limiting caused by:
Depending on specific error messages is fragile because:
Therefore, all rate limit errors are treated as throttling and handled with retry logic, regardless of whether they're token-based or request-based. This provides consistent, robust behavior across all OpenAI-compatible providers.
Related Issues
Partially addresses #860
Note: This PR fixes the missing exception handling in the OpenAI model provider, but does not solve the underlying issue reported in #860. The user is experiencing TPM rate limit errors and expecting the
SummarizingConversationManager
to handle them by reducing context. However, rate limit errors should be handled with retry logic, not context reduction, as reducing context size won't help with TPM limits that are time-based quotas.Type of Change
Bug fix
Testing
How have you tested the change?
hatch run prepare
ContextWindowOverflowException
ModelThrottledException
ModelThrottledException
stream()
andstructured_output()
methodsTest Coverage
Behavior Changes
Before
After
Checklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.