Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,63 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [1.3.0] - 2025-11-13

### Added
- **Adaptive Per-Model Capability Detection** - Complete refactor replacing hardcoded patterns (#7)
- Automatically learns which parameters each `(provider, model)` combination supports
- Per-model capability caching with `CacheKey{BaseURL, Model}` structure
- Thread-safe in-memory cache protected by `sync.RWMutex`
- Debug logging for cache hits/misses visible with `-d` flag
- **Zero-Configuration Provider Compatibility**
- Works with any OpenAI-compatible provider without code changes
- Automatic retry mechanism with error-based detection
- Broad keyword matching for parameter error detection
- No status code restrictions (handles misconfigured providers)
- **OpenWebUI Support** - Native support for OpenWebUI/LiteLLM backends
- Automatically adapts to OpenWebUI's parameter quirks
- First request detection (~1-2s penalty), instant subsequent requests
- Tested with GPT-5 and GPT-4.1 models

### Changed
- **Removed ~100 lines of hardcoded model patterns**
- Deleted `IsReasoningModel()` function with gpt-5/o1/o2/o3/o4 patterns
- Deleted `FetchReasoningModels()` function and OpenRouter API calls
- Deleted `ReasoningModelCache` struct and related code
- Removed unused imports: `encoding/json`, `net/http` from config.go
- **Refactored capability detection system**
- Changed from per-provider to per-model caching
- Struct-based cache keys (zero collision risk vs string concatenation)
- `GetProviderCapabilities()` → `GetModelCapabilities()`
- `SetProviderCapabilities()` → `SetModelCapabilities()`
- `ShouldUseMaxCompletionTokens()` now uses per-model cache
- **Enhanced retry logic in handlers.go**
- `isMaxTokensParameterError()` uses broad keyword matching
- `retryWithoutMaxCompletionTokens()` caches per-model capabilities
- Applied to both streaming and non-streaming handlers
- Removed status code restrictions for better provider compatibility

### Removed
- Hardcoded reasoning model patterns (gpt-5*, o1*, o2*, o3*, o4*)
- OpenRouter reasoning models API integration
- Provider-specific hardcoding for Unknown provider type
- Unused configuration imports and dead code

### Technical Details
- **Cache Structure**: `map[CacheKey]*ModelCapabilities` where `CacheKey{BaseURL, Model}`
- **Detection Flow**: Try max_completion_tokens → Error → Retry → Cache result
- **Error Detection**: Broad keyword matching (parameter + unsupported/invalid) + our param names
- **Cache Scope**: In-memory, thread-safe, cleared on restart
- **Benefits**: Future-proof, zero user config, ~70 net lines removed

### Documentation
- Added "Adaptive Per-Model Detection" section to README.md with full implementation details
- Updated CLAUDE.md with comprehensive per-model caching documentation
- Cleaned up docs/ folder - removed planning artifacts and superseded documentation

### Philosophy
This release embodies the project philosophy: "Support all provider quirks automatically - never burden users with configurations they don't understand." The adaptive system eliminates special-casing and works with any current or future OpenAI-compatible provider.

## [1.2.0] - 2025-11-01

### Added
Expand Down
84 changes: 84 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,90 @@ The `mapModel()` function in converter.go implements intelligent routing:

Override via environment variables to route to alternative models (Grok, Gemini, DeepSeek-R1, etc.).

### Adaptive Per-Model Capability Detection

**Core Philosophy**: Support all provider quirks automatically - never burden users with advance configs.

The proxy uses a fully adaptive system that automatically learns what parameters each model supports through error-based retry and caching. This eliminates ALL hardcoded model patterns (~100 lines removed in v1.3.0).

**How It Works:**

1. **First Request (Cache Miss)**:
- `ShouldUseMaxCompletionTokens()` checks cache for `CacheKey{BaseURL, Model}`
- Cache miss → defaults to trying `max_completion_tokens` (correct for reasoning models)
- If provider returns "unsupported parameter" error, `retryWithoutMaxCompletionTokens()` is called
- Retry succeeds → cache `{UsesMaxCompletionTokens: false}`
- Original request succeeds → cache `{UsesMaxCompletionTokens: true}`

2. **Subsequent Requests (Cache Hit)**:
- `ShouldUseMaxCompletionTokens()` returns cached value immediately
- No trial-and-error needed
- ~1-2 second first request penalty, instant thereafter

**Cache Structure** (`internal/config/config.go:29-48`):

```go
type CacheKey struct {
BaseURL string // Provider base URL (e.g., "https://gpt.erst.dk/api")
Model string // Model name (e.g., "gpt-5")
}

type ModelCapabilities struct {
UsesMaxCompletionTokens bool // Learned via adaptive retry
LastChecked time.Time // Timestamp
}

// Global cache: map[CacheKey]*ModelCapabilities
// Protected by sync.RWMutex for thread-safety
```

**Error Detection** (`internal/server/handlers.go:895-913`):

```go
func isMaxTokensParameterError(errorMessage string) bool {
errorLower := strings.ToLower(errorMessage)

// Broad keyword matching (no status code restriction)
hasParamIndicator := strings.Contains(errorLower, "parameter") ||
strings.Contains(errorLower, "unsupported") ||
strings.Contains(errorLower, "invalid")

hasOurParam := strings.Contains(errorLower, "max_tokens") ||
strings.Contains(errorLower, "max_completion_tokens")

return hasParamIndicator && hasOurParam
}
```

**Debug Logging**:

Start proxy with `-d` flag to see cache activity:

```bash
./claude-code-proxy -d -s

# Console output shows:
[DEBUG] Cache MISS: gpt-5 → will auto-detect (try max_completion_tokens)
[DEBUG] Cached: model gpt-5 supports max_completion_tokens (streaming)
[DEBUG] Cache HIT: gpt-5 → max_completion_tokens=true
```

**Key Benefits**:

- **Future-proof**: Works with any new model/provider without code changes
- **Zero user config**: No need to know which parameters each provider supports
- **Per-model granularity**: Same model name on different providers cached separately
- **Thread-safe**: Protected by `sync.RWMutex` for concurrent requests
- **In-memory**: Cleared on restart (first request re-detects)

**What Was Removed** (v1.3.0):

- `IsReasoningModel()` function (30 lines) - checked for gpt-5/o1/o2/o3/o4 patterns
- `FetchReasoningModels()` function (56 lines) - OpenRouter API calls
- `ReasoningModelCache` struct (11 lines) - per-provider reasoning model lists
- Provider-specific hardcoding for Unknown provider type
- ~100 lines total removed, replaced with ~30 lines of adaptive detection

## Configuration System

Config loading priority (see `internal/config/config.go`):
Expand Down
80 changes: 80 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@ A lightweight HTTP proxy that enables Claude Code to work with OpenAI-compatible
- **OpenRouter**: 200+ models (GPT, Grok, Gemini, etc.) through single API
- **OpenAI Direct**: Native GPT-5 reasoning model support
- **Ollama**: Free local inference with DeepSeek-R1, Llama3, Qwen, etc.
- ✅ **Adaptive Per-Model Detection** - Zero-config provider compatibility
- Automatically learns which parameters each model supports
- No hardcoded model patterns - works with any future model/provider
- Per-model capability caching for instant subsequent requests
- ✅ **Pattern-based routing** - Auto-detects Claude models and routes to appropriate backend models
- ✅ **Zero dependencies** - Single ~10MB binary, no runtime needed
- ✅ **Daemon mode** - Runs in background, serves multiple Claude Code sessions
Expand Down Expand Up @@ -390,6 +394,82 @@ See [CLAUDE.md](CLAUDE.md#manual-testing) for detailed testing instructions incl
- Generates proper event sequence (message_start, content_block_start, deltas, etc.)
- Tracks content block indices for proper Claude Code rendering

## Adaptive Per-Model Detection

The proxy uses a fully adaptive system that automatically learns what parameters each model supports, eliminating the need for hardcoded model patterns or provider-specific configuration.

### How It Works

**Philosophy:** Support all provider quirks automatically - never burden users with configurations they don't understand.

1. **First Request** (Cache Miss):
```
[DEBUG] Cache MISS: gpt-5 → will auto-detect (try max_completion_tokens)
```
- Proxy tries sending `max_completion_tokens` (correct for reasoning models)
- If provider returns "unsupported parameter" error, automatically retries without it
- Result is cached per `(provider, model)` combination

2. **Subsequent Requests** (Cache Hit):
```
[DEBUG] Cache HIT: gpt-5 → max_completion_tokens=true
```
- Proxy uses cached knowledge immediately
- No trial-and-error needed
- Instant parameter selection

### Benefits

- **Zero Configuration** - No need to know which parameters each provider supports
- **Future-Proof** - Works with any new model/provider without code changes
- **Fast** - Only 1-2 second penalty on first request, instant thereafter
- **Provider-Agnostic** - Automatically adapts to OpenRouter, OpenAI Direct, Ollama, OpenWebUI, or any OpenAI-compatible provider
- **Per-Model Granularity** - Same model name on different providers cached separately

### Cache Details

**What's Cached:**
```go
CacheKey{
BaseURL: "https://gpt.erst.dk/api", // Provider
Model: "gpt-5" // Model name
}
→ ModelCapabilities{
UsesMaxCompletionTokens: false, // Learned capability
LastChecked: time.Now() // Timestamp
}
```

**Cache Scope:**
- In-memory only (cleared on proxy restart)
- Thread-safe (protected by `sync.RWMutex`)
- Per (provider, model) combination
- Visible in debug logs (`-d` flag)

### Example: OpenWebUI

When using OpenWebUI (which has a quirk with `max_completion_tokens`):

| Request | What Happens | Duration |
|---------|--------------|----------|
| 1st | Try max_completion_tokens → Error → Retry without it | ~2 seconds |
| 2nd+ | Use cached knowledge (no retry) | < 100ms |

**No configuration needed** - the proxy learns and adapts automatically.

### Debug Logging

Enable debug mode to see cache activity:

```bash
./claude-code-proxy -d -s

# Logs show:
# [DEBUG] Cache MISS: gpt-5 → will auto-detect (try max_completion_tokens)
# [DEBUG] Cached: model gpt-5 supports max_completion_tokens
# [DEBUG] Cache HIT: gpt-5 → max_completion_tokens=true
```

## License

MIT
13 changes: 2 additions & 11 deletions cmd/claude-code-proxy/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,18 +77,9 @@ func main() {
os.Exit(1)
}

// Fetch reasoning models from OpenRouter (dynamic detection)
// This happens asynchronously and non-blocking - falls back to hardcoded patterns if it fails
go func() {
if err := cfg.FetchReasoningModels(); err != nil {
// Silent failure - hardcoded fallback will work
if cfg.Debug {
fmt.Printf("[DEBUG] Failed to fetch reasoning models from OpenRouter: %v\n", err)
}
}
}()

// Start HTTP server (blocks)
// Note: No need to pre-fetch reasoning models - adaptive per-model detection
// handles all models automatically through retry mechanism
if err := server.Start(cfg); err != nil {
fmt.Fprintf(os.Stderr, "Error starting server: %v\n", err)
os.Exit(1)
Expand Down
Loading
Loading