nielspeter · nielspeter · Nov 12, 2025 · Nov 13, 2025
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,63 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [1.3.0] - 2025-11-13
+
+### Added
+- **Adaptive Per-Model Capability Detection** - Complete refactor replacing hardcoded patterns (#7)
+  - Automatically learns which parameters each `(provider, model)` combination supports
+  - Per-model capability caching with `CacheKey{BaseURL, Model}` structure
+  - Thread-safe in-memory cache protected by `sync.RWMutex`
+  - Debug logging for cache hits/misses visible with `-d` flag
+- **Zero-Configuration Provider Compatibility**
+  - Works with any OpenAI-compatible provider without code changes
+  - Automatic retry mechanism with error-based detection
+  - Broad keyword matching for parameter error detection
+  - No status code restrictions (handles misconfigured providers)
+- **OpenWebUI Support** - Native support for OpenWebUI/LiteLLM backends
+  - Automatically adapts to OpenWebUI's parameter quirks
+  - First request detection (~1-2s penalty), instant subsequent requests
+  - Tested with GPT-5 and GPT-4.1 models
+
+### Changed
+- **Removed ~100 lines of hardcoded model patterns**
+  - Deleted `IsReasoningModel()` function with gpt-5/o1/o2/o3/o4 patterns
+  - Deleted `FetchReasoningModels()` function and OpenRouter API calls
+  - Deleted `ReasoningModelCache` struct and related code
+  - Removed unused imports: `encoding/json`, `net/http` from config.go
+- **Refactored capability detection system**
+  - Changed from per-provider to per-model caching
+  - Struct-based cache keys (zero collision risk vs string concatenation)
+  - `GetProviderCapabilities()` → `GetModelCapabilities()`
+  - `SetProviderCapabilities()` → `SetModelCapabilities()`
+  - `ShouldUseMaxCompletionTokens()` now uses per-model cache
+- **Enhanced retry logic in handlers.go**
+  - `isMaxTokensParameterError()` uses broad keyword matching
+  - `retryWithoutMaxCompletionTokens()` caches per-model capabilities
+  - Applied to both streaming and non-streaming handlers
+  - Removed status code restrictions for better provider compatibility
+
+### Removed
+- Hardcoded reasoning model patterns (gpt-5*, o1*, o2*, o3*, o4*)
+- OpenRouter reasoning models API integration
+- Provider-specific hardcoding for Unknown provider type
+- Unused configuration imports and dead code
+
+### Technical Details
+- **Cache Structure**: `map[CacheKey]*ModelCapabilities` where `CacheKey{BaseURL, Model}`
+- **Detection Flow**: Try max_completion_tokens → Error → Retry → Cache result
+- **Error Detection**: Broad keyword matching (parameter + unsupported/invalid) + our param names
+- **Cache Scope**: In-memory, thread-safe, cleared on restart
+- **Benefits**: Future-proof, zero user config, ~70 net lines removed
+
+### Documentation
+- Added "Adaptive Per-Model Detection" section to README.md with full implementation details
+- Updated CLAUDE.md with comprehensive per-model caching documentation
+- Cleaned up docs/ folder - removed planning artifacts and superseded documentation
+
+### Philosophy
+This release embodies the project philosophy: "Support all provider quirks automatically - never burden users with configurations they don't understand." The adaptive system eliminates special-casing and works with any current or future OpenAI-compatible provider.
+
 ## [1.2.0] - 2025-11-01
 
 ### Added

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -106,6 +106,90 @@ The `mapModel()` function in converter.go implements intelligent routing:
 
 Override via environment variables to route to alternative models (Grok, Gemini, DeepSeek-R1, etc.).
 
+###  Adaptive Per-Model Capability Detection
+
+**Core Philosophy**: Support all provider quirks automatically - never burden users with advance configs.
+
+The proxy uses a fully adaptive system that automatically learns what parameters each model supports through error-based retry and caching. This eliminates ALL hardcoded model patterns (~100 lines removed in v1.3.0).
+
+**How It Works:**
+
+1. **First Request (Cache Miss)**:
+   - `ShouldUseMaxCompletionTokens()` checks cache for `CacheKey{BaseURL, Model}`
+   - Cache miss → defaults to trying `max_completion_tokens` (correct for reasoning models)
+   - If provider returns "unsupported parameter" error, `retryWithoutMaxCompletionTokens()` is called
+   - Retry succeeds → cache `{UsesMaxCompletionTokens: false}`
+   - Original request succeeds → cache `{UsesMaxCompletionTokens: true}`
+
+2. **Subsequent Requests (Cache Hit)**:
+   - `ShouldUseMaxCompletionTokens()` returns cached value immediately
+   - No trial-and-error needed
+   - ~1-2 second first request penalty, instant thereafter
+
+**Cache Structure** (`internal/config/config.go:29-48`):
+
+```go
+type CacheKey struct {
+    BaseURL string  // Provider base URL (e.g., "https://gpt.erst.dk/api")
+    Model   string  // Model name (e.g., "gpt-5")
+}
+
+type ModelCapabilities struct {
+    UsesMaxCompletionTokens bool      // Learned via adaptive retry
+    LastChecked             time.Time // Timestamp
+}
+
+// Global cache: map[CacheKey]*ModelCapabilities
+// Protected by sync.RWMutex for thread-safety
+```
+
+**Error Detection** (`internal/server/handlers.go:895-913`):
+
+```go
+func isMaxTokensParameterError(errorMessage string) bool {
+    errorLower := strings.ToLower(errorMessage)
+
+    // Broad keyword matching (no status code restriction)
+    hasParamIndicator := strings.Contains(errorLower, "parameter") ||
+                        strings.Contains(errorLower, "unsupported") ||
+                        strings.Contains(errorLower, "invalid")
+
+    hasOurParam := strings.Contains(errorLower, "max_tokens") ||
+                   strings.Contains(errorLower, "max_completion_tokens")
+
+    return hasParamIndicator && hasOurParam
+}
+```
+
+**Debug Logging**:
+
+Start proxy with `-d` flag to see cache activity:
+
+```bash
+./claude-code-proxy -d -s
+
+# Console output shows:
+[DEBUG] Cache MISS: gpt-5 → will auto-detect (try max_completion_tokens)
+[DEBUG] Cached: model gpt-5 supports max_completion_tokens (streaming)
+[DEBUG] Cache HIT: gpt-5 → max_completion_tokens=true
+```
+
+**Key Benefits**:
+
+- **Future-proof**: Works with any new model/provider without code changes
+- **Zero user config**: No need to know which parameters each provider supports
+- **Per-model granularity**: Same model name on different providers cached separately
+- **Thread-safe**: Protected by `sync.RWMutex` for concurrent requests
+- **In-memory**: Cleared on restart (first request re-detects)
+
+**What Was Removed** (v1.3.0):
+
+- `IsReasoningModel()` function (30 lines) - checked for gpt-5/o1/o2/o3/o4 patterns
+- `FetchReasoningModels()` function (56 lines) - OpenRouter API calls
+- `ReasoningModelCache` struct (11 lines) - per-provider reasoning model lists
+- Provider-specific hardcoding for Unknown provider type
+- ~100 lines total removed, replaced with ~30 lines of adaptive detection
+
 ## Configuration System
 
 Config loading priority (see `internal/config/config.go`):

diff --git a/README.md b/README.md
@@ -25,6 +25,10 @@ A lightweight HTTP proxy that enables Claude Code to work with OpenAI-compatible
   - **OpenRouter**: 200+ models (GPT, Grok, Gemini, etc.) through single API
   - **OpenAI Direct**: Native GPT-5 reasoning model support
   - **Ollama**: Free local inference with DeepSeek-R1, Llama3, Qwen, etc.
+- ✅ **Adaptive Per-Model Detection** - Zero-config provider compatibility
+  - Automatically learns which parameters each model supports
+  - No hardcoded model patterns - works with any future model/provider
+  - Per-model capability caching for instant subsequent requests
 - ✅ **Pattern-based routing** - Auto-detects Claude models and routes to appropriate backend models
 - ✅ **Zero dependencies** - Single ~10MB binary, no runtime needed
 - ✅ **Daemon mode** - Runs in background, serves multiple Claude Code sessions
@@ -390,6 +394,82 @@ See [CLAUDE.md](CLAUDE.md#manual-testing) for detailed testing instructions incl
    - Generates proper event sequence (message_start, content_block_start, deltas, etc.)
    - Tracks content block indices for proper Claude Code rendering
 
+## Adaptive Per-Model Detection
+
+The proxy uses a fully adaptive system that automatically learns what parameters each model supports, eliminating the need for hardcoded model patterns or provider-specific configuration.
+
+### How It Works
+
+**Philosophy:** Support all provider quirks automatically - never burden users with configurations they don't understand.
+
+1. **First Request** (Cache Miss):
+   ```
+   [DEBUG] Cache MISS: gpt-5 → will auto-detect (try max_completion_tokens)
+   ```
+   - Proxy tries sending `max_completion_tokens` (correct for reasoning models)
+   - If provider returns "unsupported parameter" error, automatically retries without it
+   - Result is cached per `(provider, model)` combination
+
+2. **Subsequent Requests** (Cache Hit):
+   ```
+   [DEBUG] Cache HIT: gpt-5 → max_completion_tokens=true
+   ```
+   - Proxy uses cached knowledge immediately
+   - No trial-and-error needed
+   - Instant parameter selection
+
+### Benefits
+
+- **Zero Configuration** - No need to know which parameters each provider supports
+- **Future-Proof** - Works with any new model/provider without code changes
+- **Fast** - Only 1-2 second penalty on first request, instant thereafter
+- **Provider-Agnostic** - Automatically adapts to OpenRouter, OpenAI Direct, Ollama, OpenWebUI, or any OpenAI-compatible provider
+- **Per-Model Granularity** - Same model name on different providers cached separately
+
+### Cache Details
+
+**What's Cached:**
+```go
+CacheKey{
+    BaseURL: "https://gpt.erst.dk/api",  // Provider
+    Model:   "gpt-5"                      // Model name
+}
+→ ModelCapabilities{
+    UsesMaxCompletionTokens: false,       // Learned capability
+    LastChecked:             time.Now()   // Timestamp
+}
+```
+
+**Cache Scope:**
+- In-memory only (cleared on proxy restart)
+- Thread-safe (protected by `sync.RWMutex`)
+- Per (provider, model) combination
+- Visible in debug logs (`-d` flag)
+
+### Example: OpenWebUI
+
+When using OpenWebUI (which has a quirk with `max_completion_tokens`):
+
+| Request | What Happens | Duration |
+|---------|--------------|----------|
+| 1st | Try max_completion_tokens → Error → Retry without it | ~2 seconds |
+| 2nd+ | Use cached knowledge (no retry) | < 100ms |
+
+**No configuration needed** - the proxy learns and adapts automatically.
+
+### Debug Logging
+
+Enable debug mode to see cache activity:
+
+```bash
+./claude-code-proxy -d -s
+
+# Logs show:
+# [DEBUG] Cache MISS: gpt-5 → will auto-detect (try max_completion_tokens)
+# [DEBUG] Cached: model gpt-5 supports max_completion_tokens
+# [DEBUG] Cache HIT: gpt-5 → max_completion_tokens=true
+```
+
 ## License
 
 MIT
diff --git a/cmd/claude-code-proxy/main.go b/cmd/claude-code-proxy/main.go
@@ -77,18 +77,9 @@ func main() {
 		os.Exit(1)
 	}
 
-	// Fetch reasoning models from OpenRouter (dynamic detection)
-	// This happens asynchronously and non-blocking - falls back to hardcoded patterns if it fails
-	go func() {
-		if err := cfg.FetchReasoningModels(); err != nil {
-			// Silent failure - hardcoded fallback will work
-			if cfg.Debug {
-				fmt.Printf("[DEBUG] Failed to fetch reasoning models from OpenRouter: %v\n", err)
-			}
-		}
-	}()
-
 	// Start HTTP server (blocks)
+	// Note: No need to pre-fetch reasoning models - adaptive per-model detection
+	// handles all models automatically through retry mechanism
 	if err := server.Start(cfg); err != nil {
 		fmt.Fprintf(os.Stderr, "Error starting server: %v\n", err)
 		os.Exit(1)