diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml index 076be6524..e4704a058 100644 --- a/docs/_data/navigation.yml +++ b/docs/_data/navigation.yml @@ -13,6 +13,8 @@ url: /guides/tools - title: Streaming url: /guides/streaming + - title: Prompt Caching + url: /guides/prompt-caching - title: Rails Integration url: /guides/rails - title: Image Generation diff --git a/docs/guides/index.md b/docs/guides/index.md index 8988a8089..aa1ef2489 100644 --- a/docs/guides/index.md +++ b/docs/guides/index.md @@ -13,30 +13,43 @@ This section contains detailed guides to help you make the most of RubyLLM. Each ## Available Guides ### [Getting Started]({% link guides/getting-started.md %}) + Learn the basics of RubyLLM and get up and running quickly with simple examples. ### [Chat]({% link guides/chat.md %}) + Explore the chat interface, which is the primary way to interact with AI models through RubyLLM. ### [Tools]({% link guides/tools.md %}) + Learn how to extend AI capabilities by creating tools that let models call your Ruby code. ### [Streaming]({% link guides/streaming.md %}) + Understand how to use streaming responses for real-time interactions. +### [Prompt Caching]({% link guides/prompt-caching.md %}) + +Learn how to use Anthropic's prompt caching feature to reduce token usage and costs. + ### [Rails Integration]({% link guides/rails.md %}) + See how to integrate RubyLLM with Rails applications, including ActiveRecord persistence. ### [Image Generation]({% link guides/image-generation.md %}) + Learn how to generate images using DALL-E and other providers. ### [Embeddings]({% link guides/embeddings.md %}) + Explore how to create vector embeddings for semantic search and other applications. ### [Error Handling]({% link guides/error-handling.md %}) + Master the techniques for robust error handling in AI applications. ### [Working with Models]({% link guides/models.md %}) + Learn how to discover, select, and work with different AI models across providers. ## Getting Help @@ -45,4 +58,4 @@ If you can't find what you're looking for in these guides, consider: 1. Checking the [API Documentation]() for detailed information about specific classes and methods 2. Looking at the [GitHub repository](https://github.com/crmne/ruby_llm) for examples and the latest updates -3. Filing an issue on GitHub if you find a bug or have a feature request \ No newline at end of file +3. Filing an issue on GitHub if you find a bug or have a feature request diff --git a/docs/guides/prompt-caching.md b/docs/guides/prompt-caching.md new file mode 100644 index 000000000..f352a6a3b --- /dev/null +++ b/docs/guides/prompt-caching.md @@ -0,0 +1,137 @@ +# Prompt Caching + +RubyLLM supports Anthropic's prompt caching feature, which allows you to cache parts of your prompts to reduce token usage and costs when making similar requests. + +## What is Prompt Caching? + +Prompt caching is a feature that allows you to mark specific parts of your prompt as cacheable. When you make a request with a cached prompt, Anthropic will: + +1. Check if the prompt prefix (up to the cache breakpoint) is already cached +2. If found, use the cached version, reducing processing time and costs +3. Otherwise, process the full prompt and cache the prefix + +This is especially useful for: + +- Prompts with many examples +- Large amounts of context or background information +- Repetitive tasks with consistent instructions +- Long multi-turn conversations + +## Supported Models + +Prompt caching is currently supported on the following Anthropic Claude models: + +- Claude 3.7 Sonnet +- Claude 3.5 Sonnet +- Claude 3.5 Haiku +- Claude 3 Haiku +- Claude 3 Opus + +## How to Use Prompt Caching + +To use prompt caching in RubyLLM, you can mark content as cacheable using the `cache_control` parameter: + +```ruby +# Create a chat with a Claude model +chat = RubyLLM.chat(model: 'claude-3-5-sonnet') + +# Add a system message with cache control +chat.with_instructions("You are an AI assistant tasked with analyzing literary works.", + cache_control: true) + +# Add a large document with cache control +chat.ask("Here's the entire text of Pride and Prejudice: [long text...]", + with: { cache_control: true }) + +# Now you can ask questions about the document without reprocessing it +chat.ask("Analyze the major themes in Pride and Prejudice.") +``` + +## Pricing + +Prompt caching introduces a different pricing structure: + +| Model | Base Input Tokens | Cache Writes | Cache Hits | Output Tokens | +| ----------------- | ----------------- | ------------- | ------------ | ------------- | +| Claude 3.7 Sonnet | $3 / MTok | $3.75 / MTok | $0.30 / MTok | $15 / MTok | +| Claude 3.5 Sonnet | $3 / MTok | $3.75 / MTok | $0.30 / MTok | $15 / MTok | +| Claude 3.5 Haiku | $0.80 / MTok | $1 / MTok | $0.08 / MTok | $4 / MTok | +| Claude 3 Haiku | $0.25 / MTok | $0.30 / MTok | $0.03 / MTok | $1.25 / MTok | +| Claude 3 Opus | $15 / MTok | $18.75 / MTok | $1.50 / MTok | $75 / MTok | + +Note: + +- Cache write tokens are 25% more expensive than base input tokens +- Cache read tokens are 90% cheaper than base input tokens +- Regular input and output tokens are priced at standard rates + +## Tracking Cache Performance + +When using prompt caching, you can track the cache performance using the following fields in the response: + +```ruby +response = chat.ask("What are the main characters in Pride and Prejudice?") + +puts "Cache creation tokens: #{response.cache_creation_input_tokens}" +puts "Cache read tokens: #{response.cache_read_input_tokens}" +puts "Regular input tokens: #{response.input_tokens}" +puts "Output tokens: #{response.output_tokens}" +``` + +## Cache Limitations + +- The minimum cacheable prompt length is: + - 1024 tokens for Claude 3.7 Sonnet, Claude 3.5 Sonnet, and Claude 3 Opus + - 2048 tokens for Claude 3.5 Haiku and Claude 3 Haiku +- Shorter prompts cannot be cached, even if marked with `cache_control` +- The cache has a minimum 5-minute lifetime +- Cache hits require 100% identical prompt segments + +## Best Practices + +- Place static content (system instructions, context, examples) at the beginning of your prompt +- Mark the end of the reusable content for caching using the `cache_control` parameter +- Use cache breakpoints strategically to separate different cacheable prefix sections +- Regularly analyze cache hit rates and adjust your strategy as needed + +## Example: Document Analysis + +```ruby +# Create a chat with Claude +chat = RubyLLM.chat(model: 'claude-3-5-sonnet') + +# Add system instructions with cache control +chat.with_instructions("You are an AI assistant tasked with analyzing documents.", + cache_control: true) + +# Add a PDF document with cache control +chat.ask("Please analyze this document:", + with: { pdf: "large_document.pdf", cache_control: true }) + +# First query - will create a cache +response1 = chat.ask("What are the main points in the executive summary?") +puts "Cache creation tokens: #{response1.cache_creation_input_tokens}" + +# Second query - will use the cache +response2 = chat.ask("Who are the key stakeholders mentioned?") +puts "Cache read tokens: #{response2.cache_read_input_tokens}" +``` + +## Example: Multi-turn Conversation + +```ruby +# Create a chat with Claude +chat = RubyLLM.chat(model: 'claude-3-5-sonnet') + +# Add system instructions with cache control +chat.with_instructions("You are a helpful coding assistant. Use these coding conventions: [long list of conventions]", + cache_control: true) + +# First query - will create a cache +response1 = chat.ask("How do I write a Ruby class for a bank account?") +puts "Cache creation tokens: #{response1.cache_creation_input_tokens}" + +# Second query - will use the cache +response2 = chat.ask("Can you show me how to add a transfer method to that class?") +puts "Cache read tokens: #{response2.cache_read_input_tokens}" +``` diff --git a/lib/ruby_llm/chat.rb b/lib/ruby_llm/chat.rb index a854bb820..db2ca98c2 100644 --- a/lib/ruby_llm/chat.rb +++ b/lib/ruby_llm/chat.rb @@ -26,16 +26,22 @@ def initialize(model: nil, provider: nil) end def ask(message = nil, with: {}, &block) - add_message role: :user, content: Content.new(message, with) + # Extract cache_control from the with hash if present + cache_control = with.delete(:cache_control) + + # Create a new Content object with the message and attachments + content = Content.new(message, with.merge(cache_control: cache_control)) + + add_message role: :user, content: content complete(&block) end alias say ask - def with_instructions(instructions, replace: false) + def with_instructions(instructions, replace: false, cache_control: nil) @messages = @messages.reject! { |msg| msg.role == :system } if replace - add_message role: :system, content: instructions + add_message role: :system, content: Content.new(instructions, cache_control: cache_control) self end diff --git a/lib/ruby_llm/content.rb b/lib/ruby_llm/content.rb index ac326ffd9..1c182615e 100644 --- a/lib/ruby_llm/content.rb +++ b/lib/ruby_llm/content.rb @@ -5,9 +5,12 @@ module RubyLLM # Stores data in a standard internal format, letting providers # handle their own formatting needs. class Content + attr_reader :cache_control + def initialize(text = nil, attachments = {}) # rubocop:disable Metrics/AbcSize,Metrics/MethodLength @parts = [] @parts << { type: 'text', text: text } unless text.nil? || text.empty? + @cache_control = attachments[:cache_control] Array(attachments[:image]).each do |source| @parts << attach_image(source) @@ -29,7 +32,7 @@ def to_a end def format - return @parts.first[:text] if @parts.size == 1 && @parts.first[:type] == 'text' + return @parts.first[:text] if @parts.size == 1 && @parts.first[:type] == 'text' && @cache_control.nil? to_a end diff --git a/lib/ruby_llm/message.rb b/lib/ruby_llm/message.rb index cf6ea7f3c..3f6854c76 100644 --- a/lib/ruby_llm/message.rb +++ b/lib/ruby_llm/message.rb @@ -7,7 +7,8 @@ module RubyLLM class Message ROLES = %i[system user assistant tool].freeze - attr_reader :role, :content, :tool_calls, :tool_call_id, :input_tokens, :output_tokens, :model_id + attr_reader :role, :content, :tool_calls, :tool_call_id, :input_tokens, :output_tokens, :model_id, + :cache_creation_input_tokens, :cache_read_input_tokens def initialize(options = {}) @role = options[:role].to_sym @@ -17,6 +18,8 @@ def initialize(options = {}) @output_tokens = options[:output_tokens] @model_id = options[:model_id] @tool_call_id = options[:tool_call_id] + @cache_creation_input_tokens = options[:cache_creation_input_tokens] + @cache_read_input_tokens = options[:cache_read_input_tokens] ensure_valid_role end @@ -41,7 +44,9 @@ def to_h tool_call_id: tool_call_id, input_tokens: input_tokens, output_tokens: output_tokens, - model_id: model_id + model_id: model_id, + cache_creation_input_tokens: cache_creation_input_tokens, + cache_read_input_tokens: cache_read_input_tokens }.compact end diff --git a/lib/ruby_llm/providers/anthropic/capabilities.rb b/lib/ruby_llm/providers/anthropic/capabilities.rb index 4e07afec1..8773cf11e 100644 --- a/lib/ruby_llm/providers/anthropic/capabilities.rb +++ b/lib/ruby_llm/providers/anthropic/capabilities.rb @@ -67,6 +67,29 @@ def supports_json_mode?(model_id) def supports_extended_thinking?(model_id) model_id.match?(/claude-3-7-sonnet/) end + + # Determines if a model supports prompt caching + # @param model_id [String] the model identifier + # @return [Boolean] true if the model supports prompt caching + def supports_caching?(model_id) + model_id.match?(/claude-3(?:-[357])?(?:-(?:opus|sonnet|haiku))/) + end + + # Gets the cache write price per million tokens for a given model + # @param model_id [String] the model identifier + # @return [Float] the price per million tokens for cache writes + def cache_write_price_for(model_id) + # Cache write tokens are 25% more expensive than base input tokens + get_input_price(model_id) * 1.25 + end + + # Gets the cache hit price per million tokens for a given model + # @param model_id [String] the model identifier + # @return [Float] the price per million tokens for cache hits + def cache_hit_price_for(model_id) + # Cache read tokens are 90% cheaper than base input tokens + get_input_price(model_id) * 0.1 + end # Determines the model family for a given model ID # @param model_id [String] the model identifier diff --git a/lib/ruby_llm/providers/anthropic/chat.rb b/lib/ruby_llm/providers/anthropic/chat.rb index 117db1c59..f219fadf8 100644 --- a/lib/ruby_llm/providers/anthropic/chat.rb +++ b/lib/ruby_llm/providers/anthropic/chat.rb @@ -72,6 +72,8 @@ def build_message(data, content, tool_use) tool_calls: parse_tool_calls(tool_use), input_tokens: data.dig('usage', 'input_tokens'), output_tokens: data.dig('usage', 'output_tokens'), + cache_creation_input_tokens: data.dig('usage', 'cache_creation_input_tokens'), + cache_read_input_tokens: data.dig('usage', 'cache_read_input_tokens'), model_id: data['model'] ) end @@ -89,7 +91,7 @@ def format_message(msg) def format_basic_message(msg) { role: convert_role(msg.role), - content: Media.format_content(msg.content) + content: Media.format_content(msg.content, msg.content.is_a?(Content) ? msg.content.cache_control : nil) } end diff --git a/lib/ruby_llm/providers/anthropic/media.rb b/lib/ruby_llm/providers/anthropic/media.rb index 7bcfbce02..ffdbbd35a 100644 --- a/lib/ruby_llm/providers/anthropic/media.rb +++ b/lib/ruby_llm/providers/anthropic/media.rb @@ -7,7 +7,7 @@ module Anthropic module Media module_function - def format_content(content) # rubocop:disable Metrics/MethodLength + def format_content(content, cache_control = nil) # rubocop:disable Metrics/MethodLength return content unless content.is_a?(Array) content.map do |part| @@ -17,7 +17,7 @@ def format_content(content) # rubocop:disable Metrics/MethodLength when 'pdf' format_pdf(part) when 'text' - format_text_block(part[:text]) + format_text_block(part[:text], cache_control) else part end @@ -57,11 +57,14 @@ def format_pdf(part) # rubocop:disable Metrics/MethodLength end end - def format_text_block(text) - { + def format_text_block(text, cache_control = nil) + block = { type: 'text', text: text } + + block[:cache_control] = { type: 'ephemeral' } if cache_control + block end end end diff --git a/lib/ruby_llm/providers/anthropic/streaming.rb b/lib/ruby_llm/providers/anthropic/streaming.rb index 3bf842150..5f4df59e6 100644 --- a/lib/ruby_llm/providers/anthropic/streaming.rb +++ b/lib/ruby_llm/providers/anthropic/streaming.rb @@ -18,9 +18,31 @@ def build_chunk(data) content: data.dig('delta', 'text'), input_tokens: extract_input_tokens(data), output_tokens: extract_output_tokens(data), + cache_creation_input_tokens: extract_cache_creation_tokens(data), + cache_read_input_tokens: extract_cache_read_tokens(data), tool_calls: extract_tool_calls(data) ) end + + def extract_model_id(data) + data['model'] + end + + def extract_input_tokens(data) + data.dig('usage', 'input_tokens') + end + + def extract_output_tokens(data) + data.dig('usage', 'output_tokens') + end + + def extract_cache_creation_tokens(data) + data.dig('usage', 'cache_creation_input_tokens') + end + + def extract_cache_read_tokens(data) + data.dig('usage', 'cache_read_input_tokens') + end def json_delta?(data) data['type'] == 'content_block_delta' && data.dig('delta', 'type') == 'input_json_delta' diff --git a/lib/ruby_llm/stream_accumulator.rb b/lib/ruby_llm/stream_accumulator.rb index 0196c6344..e22761414 100644 --- a/lib/ruby_llm/stream_accumulator.rb +++ b/lib/ruby_llm/stream_accumulator.rb @@ -12,6 +12,8 @@ def initialize @tool_calls = {} @input_tokens = 0 @output_tokens = 0 + @cache_creation_input_tokens = 0 + @cache_read_input_tokens = 0 @latest_tool_call_id = nil end @@ -36,7 +38,9 @@ def to_message model_id: model_id, tool_calls: tool_calls_from_stream, input_tokens: @input_tokens.positive? ? @input_tokens : nil, - output_tokens: @output_tokens.positive? ? @output_tokens : nil + output_tokens: @output_tokens.positive? ? @output_tokens : nil, + cache_creation_input_tokens: @cache_creation_input_tokens.positive? ? @cache_creation_input_tokens : nil, + cache_read_input_tokens: @cache_read_input_tokens.positive? ? @cache_read_input_tokens : nil ) end @@ -91,6 +95,8 @@ def find_tool_call(tool_call_id) def count_tokens(chunk) @input_tokens += chunk.input_tokens if chunk.input_tokens @output_tokens += chunk.output_tokens if chunk.output_tokens + @cache_creation_input_tokens += chunk.cache_creation_input_tokens if chunk.cache_creation_input_tokens + @cache_read_input_tokens += chunk.cache_read_input_tokens if chunk.cache_read_input_tokens end end end diff --git a/spec/ruby_llm/anthropic_caching_spec.rb b/spec/ruby_llm/anthropic_caching_spec.rb new file mode 100644 index 000000000..8d087ac4e --- /dev/null +++ b/spec/ruby_llm/anthropic_caching_spec.rb @@ -0,0 +1,79 @@ +# frozen_string_literal: true + +require 'spec_helper' + +RSpec.describe 'Anthropic Prompt Caching' do + let(:model) { 'claude-3-5-sonnet-20240620' } + + it 'supports cache control in content' do + content = RubyLLM::Content.new('Hello, world!', cache_control: true) + expect(content.cache_control).to be true + end + + it 'formats content with cache control for Anthropic' do + content = RubyLLM::Content.new('Hello, world!', cache_control: true) + formatted = RubyLLM::Providers::Anthropic::Media.format_content([{ type: 'text', text: 'Hello, world!' }], true) + + expect(formatted.first[:cache_control]).to eq({ type: 'ephemeral' }) + end + + it 'includes cache control in message formatting' do + # Create a message with cache control + content = RubyLLM::Content.new('Hello, world!', cache_control: true) + message = RubyLLM::Message.new(role: :user, content: content) + + # Format the message using the Anthropic provider + formatted = RubyLLM::Providers::Anthropic.send(:format_basic_message, message) + + # Check that the formatted message includes cache control + expect(formatted[:content].first[:cache_control]).to eq({ type: 'ephemeral' }) + end + + it 'tracks cache-related token usage in messages' do + # Create a message with cache-related token usage + message = RubyLLM::Message.new( + role: :assistant, + content: 'Hello, world!', + cache_creation_input_tokens: 100, + cache_read_input_tokens: 0 + ) + + # Check that the message includes cache-related token usage + expect(message.cache_creation_input_tokens).to eq(100) + expect(message.cache_read_input_tokens).to eq(0) + + # Check that the hash representation includes cache-related token usage + expect(message.to_h[:cache_creation_input_tokens]).to eq(100) + expect(message.to_h[:cache_read_input_tokens]).to eq(0) + end + + it 'reports cache pricing information' do + # Get the input price for the model + input_price = RubyLLM::Providers::Anthropic::Capabilities.get_input_price(model) + + # Get the cache write price for the model + cache_write_price = RubyLLM::Providers::Anthropic::Capabilities.cache_write_price_for(model) + + # Get the cache hit price for the model + cache_hit_price = RubyLLM::Providers::Anthropic::Capabilities.cache_hit_price_for(model) + + # Check that the cache write price is 25% more than the input price + expect(cache_write_price).to eq(input_price * 1.25) + + # Check that the cache hit price is 90% less than the input price + expect(cache_hit_price).to eq(input_price * 0.1) + end + + it 'reports caching support for Claude models' do + # Check that caching is supported for Claude 3 models + expect(RubyLLM::Providers::Anthropic::Capabilities.supports_caching?('claude-3-opus-20240229')).to be true + expect(RubyLLM::Providers::Anthropic::Capabilities.supports_caching?('claude-3-sonnet-20240229')).to be true + expect(RubyLLM::Providers::Anthropic::Capabilities.supports_caching?('claude-3-haiku-20240307')).to be true + expect(RubyLLM::Providers::Anthropic::Capabilities.supports_caching?('claude-3-5-sonnet-20240620')).to be true + expect(RubyLLM::Providers::Anthropic::Capabilities.supports_caching?('claude-3-7-sonnet-20250219')).to be true + + # Check that caching is not supported for Claude 2 models + expect(RubyLLM::Providers::Anthropic::Capabilities.supports_caching?('claude-2.0')).to be false + expect(RubyLLM::Providers::Anthropic::Capabilities.supports_caching?('claude-2.1')).to be false + end +end