-
-
Notifications
You must be signed in to change notification settings - Fork 304
Prompt Caching #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prompt Caching #234
Conversation
|
@crmne As I don't have an Anthropic key, I'll need you to generate the VCR cartridges for that provider. Hoping everything just works, but let me know if not. |
|
@tpaulshippy this would be great to have! Will you be willing to enable it on all providers? I'll do a proper review when I can. |
|
My five minutes of research indicates that at least OpenAI and Gemini take the approach of automatically caching for you based on the size and structure of your request. So the only support I think we'd really need for those two is to populate the cached token counts on the response messages. Unless we want to try to support explicit caching on the Gemini API but that looks complex and not as commonly needed. Do you know of other providers that require payload changes for prompt caching? |
| def with_cache_control(hash, cache: false) | ||
| return hash unless cache | ||
|
|
||
| hash.merge(cache_control: { type: 'ephemeral' }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realizing this might cause errors on older models that do not support caching. If it does, we could raise here, or just let the API validation handle it. I'm torn on whether the capabilities check complexity is worth it as these models are probably so rarely used.
Scratch that. I decided to stop being a cheapskate and just pay Anthropic their $5. |
|
Looking to implement this in our project and now I'm wondering if it should be an opt out rather than an opt in. If you are using unique prompts every time I guess it adds some cost to cache them but my guess is in most applications prompts will get repeated, especially system prompts. |
crmne
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this feature @tpaulshippy, however there are several improvements I'd like you to make before we merge this.
On top of the ones made in the comments, and the most important one, I'd like to have prompt caching implemented in all providers.
Plus I have not fully checked the logic in providers/anthropic but the patch seems a bit heavy-handed with the amount of changes needed at first glance. Where all changes necessary or could it be done in a simpler manner?
Did you see this? Is the request to populate the cached token counts on the response messages for OpenAI and Gemini? |
Thank you for pointing that out, I had missed it. I think it would certainly be a nice addition to RubyLLM to have all providers have almost the same level of support of caching. |
Ok we have a bit of a naming issue. Here's the property names we get from each provider: Anthropic OpenAI Gemini My reading of the docs indicates that the OpenAI and Gemini values correspond pretty closely with the What should we call these properties in the Message? |
|
For the naming, let's go with:
This keeps it consistent with our existing Can you update the Message properties to use these names? Thanks Paul! |
|
One-shot prompt scenarios is our main use case, the above would work great. Caching support is also a blocker on us making the jump to RubyLLM, thanks all! |
No need to wait: use RubyLLM
.chat(model: "claude-sonnet-4-20250514")
.with_params(system: [{
type: "text",
text: "This is my very long system prompt that will get cached.",
cache_control: { type: "ephemeral" },
}]) |
Hm, When trying that @maximevaillancourt , and later doing a .ask, my system prompt doesn't end up getting into openrouter. Are you using this approach successfully? |
Yes, but worth noting that I'm using |
|
Hi @tpaulshippy @crmne -- anything we can do to help this along? Happy to help out if needed. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #234 +/- ##
==========================================
+ Coverage 89.72% 89.78% +0.06%
==========================================
Files 36 36
Lines 1761 1772 +11
Branches 481 487 +6
==========================================
+ Hits 1580 1591 +11
Misses 181 181 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks for picking this back up! Have you played around much with the 1h TTL, @tpaulshippy https://docs.claude.com/en/docs/build-with-claude/prompt-caching#1-hour-cache-duration |
No I haven't. Didn't seem that useful for our scenarios. Would be a good addition to this library though. |
|
@tpaulshippy, I really appreciate the work you poured into this. However, I've had a nagging gut feeling the whole time. The amount of churn here never felt proportionate to the feature. This ended up rewriting a good chunk of the library for what's ultimately an Anthropic quirk. In the end it was impossible to review this in a way that steered toward what I had in mind without actually building it myself: once I went hands-on I discovered my own earlier suggestion about The exploration led to Raw Content Blocks: 869a755 - raw messages that go straight to the LLM. This way Anthropic gets its caching hooks, we can support any weird provider-specific quirk of the message contents, and we keep the core clean and provider-agnostic. I've shipped docs https://rubyllm.com/chat/#raw-content-blocks, and updated the Rails integration, as well as making an update generator for 1.9. Thanks again for your work and enjoy Raw Content Blocks! |
|
This new change is working almost perfectly for us, @crmne ! One callout: we're using the OpenRouter provider (but ultimately Anthropic models primarily within that), and while the message caching with raw blocks (using RubyLLM::Providers::Anthropic::Content in an openrouter chat) is working great, the tool with_params pattern (in the tool subclasses) is not caching definitions. I think it's because of the subclass hierarchy. Also, Anthropic's a bit strange in that their docs have you cache the last tool, https://docs.claude.com/en/docs/build-with-claude/prompt-caching#prompt-caching-examples (Caching tool definitions). |
|
As mentioned in the docs, Tool's |
I really liked this idea. Any chance we could get it? Cache the last system message, last user message, and last tool by default? |
This would mean changing the whole thing again and re-add a lot of your code only for a bit of magic in a provider quirk. Hard pass. This belongs in your app. Also, that comment precedes the whole investigation I did. |
|
Ok fair enough. I bring it up because one of the strengths of this library is the ability to switch between providers and models seamlessly. Since Open AI and Gemini cache by default, setting up Anthropic to do the same would be nice. |
|
I think the difference is that Gemini and OpenAI don't charge the user extra for the cache writes, while Anthropic does. |
|
That is true. But in most use cases they charge even more if you don't cache at all. Thus, this PR. |
|
Even if it were opt in, a way to properly turn on caching for Anthropic in one line without having to track which will be your last tool seems like it would be nice. |
What this does
Automatically opts into prompt caching in both Anthropic and Bedrock providers for Claude models that support it. And report prompt caching token counts for OpenAI and Gemini which cache automatically.
Disable prompt caching:
Caching just system prompts:
Caching just user prompts:
Caching just tool definitions:
Caching system prompts and tool definitions:
Type of change
Scope check
Quality check
overcommit --installand all hooks passmodels.json,aliases.json)API changes
Related issues
Resolves #13