diff --git a/CHANGELOG.md b/CHANGELOG.md
index 784bd36..5d009de 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -7,6 +7,63 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [1.3.0] - 2025-11-13
+
+### Added
+- **Adaptive Per-Model Capability Detection** - Complete refactor replacing hardcoded patterns (#7)
+  - Automatically learns which parameters each `(provider, model)` combination supports
+  - Per-model capability caching with `CacheKey{BaseURL, Model}` structure
+  - Thread-safe in-memory cache protected by `sync.RWMutex`
+  - Debug logging for cache hits/misses visible with `-d` flag
+- **Zero-Configuration Provider Compatibility**
+  - Works with any OpenAI-compatible provider without code changes
+  - Automatic retry mechanism with error-based detection
+  - Broad keyword matching for parameter error detection
+  - No status code restrictions (handles misconfigured providers)
+- **OpenWebUI Support** - Native support for OpenWebUI/LiteLLM backends
+  - Automatically adapts to OpenWebUI's parameter quirks
+  - First request detection (~1-2s penalty), instant subsequent requests
+  - Tested with GPT-5 and GPT-4.1 models
+
+### Changed
+- **Removed ~100 lines of hardcoded model patterns**
+  - Deleted `IsReasoningModel()` function with gpt-5/o1/o2/o3/o4 patterns
+  - Deleted `FetchReasoningModels()` function and OpenRouter API calls
+  - Deleted `ReasoningModelCache` struct and related code
+  - Removed unused imports: `encoding/json`, `net/http` from config.go
+- **Refactored capability detection system**
+  - Changed from per-provider to per-model caching
+  - Struct-based cache keys (zero collision risk vs string concatenation)
+  - `GetProviderCapabilities()` → `GetModelCapabilities()`
+  - `SetProviderCapabilities()` → `SetModelCapabilities()`
+  - `ShouldUseMaxCompletionTokens()` now uses per-model cache
+- **Enhanced retry logic in handlers.go**
+  - `isMaxTokensParameterError()` uses broad keyword matching
+  - `retryWithoutMaxCompletionTokens()` caches per-model capabilities
+  - Applied to both streaming and non-streaming handlers
+  - Removed status code restrictions for better provider compatibility
+
+### Removed
+- Hardcoded reasoning model patterns (gpt-5*, o1*, o2*, o3*, o4*)
+- OpenRouter reasoning models API integration
+- Provider-specific hardcoding for Unknown provider type
+- Unused configuration imports and dead code
+
+### Technical Details
+- **Cache Structure**: `map[CacheKey]*ModelCapabilities` where `CacheKey{BaseURL, Model}`
+- **Detection Flow**: Try max_completion_tokens → Error → Retry → Cache result
+- **Error Detection**: Broad keyword matching (parameter + unsupported/invalid) + our param names
+- **Cache Scope**: In-memory, thread-safe, cleared on restart
+- **Benefits**: Future-proof, zero user config, ~70 net lines removed
+
+### Documentation
+- Added "Adaptive Per-Model Detection" section to README.md with full implementation details
+- Updated CLAUDE.md with comprehensive per-model caching documentation
+- Cleaned up docs/ folder - removed planning artifacts and superseded documentation
+
+### Philosophy
+This release embodies the project philosophy: "Support all provider quirks automatically - never burden users with configurations they don't understand." The adaptive system eliminates special-casing and works with any current or future OpenAI-compatible provider.
+
 ## [1.2.0] - 2025-11-01
 
 ### Added
diff --git a/CLAUDE.md b/CLAUDE.md
index 69dcfcc..a73fc4e 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -106,6 +106,90 @@ The `mapModel()` function in converter.go implements intelligent routing:
 
 Override via environment variables to route to alternative models (Grok, Gemini, DeepSeek-R1, etc.).
 
+###  Adaptive Per-Model Capability Detection
+
+**Core Philosophy**: Support all provider quirks automatically - never burden users with advance configs.
+
+The proxy uses a fully adaptive system that automatically learns what parameters each model supports through error-based retry and caching. This eliminates ALL hardcoded model patterns (~100 lines removed in v1.3.0).
+
+**How It Works:**
+
+1. **First Request (Cache Miss)**:
+   - `ShouldUseMaxCompletionTokens()` checks cache for `CacheKey{BaseURL, Model}`
+   - Cache miss → defaults to trying `max_completion_tokens` (correct for reasoning models)
+   - If provider returns "unsupported parameter" error, `retryWithoutMaxCompletionTokens()` is called
+   - Retry succeeds → cache `{UsesMaxCompletionTokens: false}`
+   - Original request succeeds → cache `{UsesMaxCompletionTokens: true}`
+
+2. **Subsequent Requests (Cache Hit)**:
+   - `ShouldUseMaxCompletionTokens()` returns cached value immediately
+   - No trial-and-error needed
+   - ~1-2 second first request penalty, instant thereafter
+
+**Cache Structure** (`internal/config/config.go:29-48`):
+
+```go
+type CacheKey struct {
+    BaseURL string  // Provider base URL (e.g., "https://gpt.erst.dk/api")
+    Model   string  // Model name (e.g., "gpt-5")
+}
+
+type ModelCapabilities struct {
+    UsesMaxCompletionTokens bool      // Learned via adaptive retry
+    LastChecked             time.Time // Timestamp
+}
+
+// Global cache: map[CacheKey]*ModelCapabilities
+// Protected by sync.RWMutex for thread-safety
+```
+
+**Error Detection** (`internal/server/handlers.go:895-913`):
+
+```go
+func isMaxTokensParameterError(errorMessage string) bool {
+    errorLower := strings.ToLower(errorMessage)
+
+    // Broad keyword matching (no status code restriction)
+    hasParamIndicator := strings.Contains(errorLower, "parameter") ||
+                        strings.Contains(errorLower, "unsupported") ||
+                        strings.Contains(errorLower, "invalid")
+
+    hasOurParam := strings.Contains(errorLower, "max_tokens") ||
+                   strings.Contains(errorLower, "max_completion_tokens")
+
+    return hasParamIndicator && hasOurParam
+}
+```
+
+**Debug Logging**:
+
+Start proxy with `-d` flag to see cache activity:
+
+```bash
+./claude-code-proxy -d -s
+
+# Console output shows:
+[DEBUG] Cache MISS: gpt-5 → will auto-detect (try max_completion_tokens)
+[DEBUG] Cached: model gpt-5 supports max_completion_tokens (streaming)
+[DEBUG] Cache HIT: gpt-5 → max_completion_tokens=true
+```
+
+**Key Benefits**:
+
+- **Future-proof**: Works with any new model/provider without code changes
+- **Zero user config**: No need to know which parameters each provider supports
+- **Per-model granularity**: Same model name on different providers cached separately
+- **Thread-safe**: Protected by `sync.RWMutex` for concurrent requests
+- **In-memory**: Cleared on restart (first request re-detects)
+
+**What Was Removed** (v1.3.0):
+
+- `IsReasoningModel()` function (30 lines) - checked for gpt-5/o1/o2/o3/o4 patterns
+- `FetchReasoningModels()` function (56 lines) - OpenRouter API calls
+- `ReasoningModelCache` struct (11 lines) - per-provider reasoning model lists
+- Provider-specific hardcoding for Unknown provider type
+- ~100 lines total removed, replaced with ~30 lines of adaptive detection
+
 ## Configuration System
 
 Config loading priority (see `internal/config/config.go`):
diff --git a/README.md b/README.md
index fce66c7..9b1c45f 100644
--- a/README.md
+++ b/README.md
@@ -25,6 +25,10 @@ A lightweight HTTP proxy that enables Claude Code to work with OpenAI-compatible
   - **OpenRouter**: 200+ models (GPT, Grok, Gemini, etc.) through single API
   - **OpenAI Direct**: Native GPT-5 reasoning model support
   - **Ollama**: Free local inference with DeepSeek-R1, Llama3, Qwen, etc.
+- ✅ **Adaptive Per-Model Detection** - Zero-config provider compatibility
+  - Automatically learns which parameters each model supports
+  - No hardcoded model patterns - works with any future model/provider
+  - Per-model capability caching for instant subsequent requests
 - ✅ **Pattern-based routing** - Auto-detects Claude models and routes to appropriate backend models
 - ✅ **Zero dependencies** - Single ~10MB binary, no runtime needed
 - ✅ **Daemon mode** - Runs in background, serves multiple Claude Code sessions
@@ -390,6 +394,82 @@ See [CLAUDE.md](CLAUDE.md#manual-testing) for detailed testing instructions incl
    - Generates proper event sequence (message_start, content_block_start, deltas, etc.)
    - Tracks content block indices for proper Claude Code rendering
 
+## Adaptive Per-Model Detection
+
+The proxy uses a fully adaptive system that automatically learns what parameters each model supports, eliminating the need for hardcoded model patterns or provider-specific configuration.
+
+### How It Works
+
+**Philosophy:** Support all provider quirks automatically - never burden users with configurations they don't understand.
+
+1. **First Request** (Cache Miss):
+   ```
+   [DEBUG] Cache MISS: gpt-5 → will auto-detect (try max_completion_tokens)
+   ```
+   - Proxy tries sending `max_completion_tokens` (correct for reasoning models)
+   - If provider returns "unsupported parameter" error, automatically retries without it
+   - Result is cached per `(provider, model)` combination
+
+2. **Subsequent Requests** (Cache Hit):
+   ```
+   [DEBUG] Cache HIT: gpt-5 → max_completion_tokens=true
+   ```
+   - Proxy uses cached knowledge immediately
+   - No trial-and-error needed
+   - Instant parameter selection
+
+### Benefits
+
+- **Zero Configuration** - No need to know which parameters each provider supports
+- **Future-Proof** - Works with any new model/provider without code changes
+- **Fast** - Only 1-2 second penalty on first request, instant thereafter
+- **Provider-Agnostic** - Automatically adapts to OpenRouter, OpenAI Direct, Ollama, OpenWebUI, or any OpenAI-compatible provider
+- **Per-Model Granularity** - Same model name on different providers cached separately
+
+### Cache Details
+
+**What's Cached:**
+```go
+CacheKey{
+    BaseURL: "https://gpt.erst.dk/api",  // Provider
+    Model:   "gpt-5"                      // Model name
+}
+→ ModelCapabilities{
+    UsesMaxCompletionTokens: false,       // Learned capability
+    LastChecked:             time.Now()   // Timestamp
+}
+```
+
+**Cache Scope:**
+- In-memory only (cleared on proxy restart)
+- Thread-safe (protected by `sync.RWMutex`)
+- Per (provider, model) combination
+- Visible in debug logs (`-d` flag)
+
+### Example: OpenWebUI
+
+When using OpenWebUI (which has a quirk with `max_completion_tokens`):
+
+| Request | What Happens | Duration |
+|---------|--------------|----------|
+| 1st | Try max_completion_tokens → Error → Retry without it | ~2 seconds |
+| 2nd+ | Use cached knowledge (no retry) | < 100ms |
+
+**No configuration needed** - the proxy learns and adapts automatically.
+
+### Debug Logging
+
+Enable debug mode to see cache activity:
+
+```bash
+./claude-code-proxy -d -s
+
+# Logs show:
+# [DEBUG] Cache MISS: gpt-5 → will auto-detect (try max_completion_tokens)
+# [DEBUG] Cached: model gpt-5 supports max_completion_tokens
+# [DEBUG] Cache HIT: gpt-5 → max_completion_tokens=true
+```
+
 ## License
 
 MIT
diff --git a/cmd/claude-code-proxy/main.go b/cmd/claude-code-proxy/main.go
index 9114c34..9821584 100644
--- a/cmd/claude-code-proxy/main.go
+++ b/cmd/claude-code-proxy/main.go
@@ -77,18 +77,9 @@ func main() {
 		os.Exit(1)
 	}
 
-	// Fetch reasoning models from OpenRouter (dynamic detection)
-	// This happens asynchronously and non-blocking - falls back to hardcoded patterns if it fails
-	go func() {
-		if err := cfg.FetchReasoningModels(); err != nil {
-			// Silent failure - hardcoded fallback will work
-			if cfg.Debug {
-				fmt.Printf("[DEBUG] Failed to fetch reasoning models from OpenRouter: %v\n", err)
-			}
-		}
-	}()
-
 	// Start HTTP server (blocks)
+	// Note: No need to pre-fetch reasoning models - adaptive per-model detection
+	// handles all models automatically through retry mechanism
 	if err := server.Start(cfg); err != nil {
 		fmt.Fprintf(os.Stderr, "Error starting server: %v\n", err)
 		os.Exit(1)
diff --git a/internal/config/config.go b/internal/config/config.go
index 52e1b3c..1043d94 100644
--- a/internal/config/config.go
+++ b/internal/config/config.go
@@ -6,12 +6,11 @@
 package config
 
 import (
-	"encoding/json"
 	"fmt"
-	"net/http"
 	"os"
 	"path/filepath"
 	"strings"
+	"sync"
 	"time"
 
 	"github.com/joho/godotenv"
@@ -27,6 +26,27 @@ const (
 	ProviderUnknown    ProviderType = "unknown"
 )
 
+// CacheKey uniquely identifies a (provider, model) combination for capability caching
+// Using a struct as map key provides type safety and zero collision risk
+type CacheKey struct {
+	BaseURL string // Provider base URL (e.g., "https://openrouter.ai/api/v1")
+	Model   string // Model name (e.g., "gpt-5", "openai/gpt-5")
+}
+
+// ModelCapabilities tracks which parameters a specific model supports
+// This is learned dynamically through adaptive retry mechanism
+type ModelCapabilities struct {
+	UsesMaxCompletionTokens bool      // Does this model use max_completion_tokens?
+	LastChecked             time.Time // When was this last verified?
+}
+
+// Global capability cache ((baseURL, model) -> capabilities)
+// Protected by mutex for thread-safe access across concurrent requests
+var (
+	modelCapabilityCache = make(map[CacheKey]*ModelCapabilities)
+	capabilityCacheMutex sync.RWMutex
+)
+
 // Config holds all proxy configuration
 type Config struct {
 	// Required
@@ -162,102 +182,52 @@ func (c *Config) IsLocalhost() bool {
 	return strings.Contains(baseURL, "localhost") || strings.Contains(baseURL, "127.0.0.1")
 }
 
-// ReasoningModelCache stores which models support reasoning capabilities.
-// This is fetched from OpenRouter's API on startup to avoid hardcoding model names.
-type ReasoningModelCache struct {
-	models    map[string]bool // model ID -> supports reasoning
-	populated bool
-}
 
-// Global cache instance
-var reasoningCache = &ReasoningModelCache{
-	models: make(map[string]bool),
+// GetModelCapabilities retrieves cached capabilities for a (provider, model) combination.
+// Returns nil if no capabilities are cached yet (first request for this model).
+// Thread-safe with read lock.
+func GetModelCapabilities(key CacheKey) *ModelCapabilities {
+	capabilityCacheMutex.RLock()
+	defer capabilityCacheMutex.RUnlock()
+	return modelCapabilityCache[key]
 }
 
-// IsReasoningModel checks if a model supports reasoning capabilities.
-// For OpenRouter, this uses the cached API data. Otherwise falls back to pattern matching.
-func (c *Config) IsReasoningModel(modelName string) bool {
-	// For OpenRouter: use cached data if available
-	if c.DetectProvider() == ProviderOpenRouter && reasoningCache.populated {
-		if isReasoning, found := reasoningCache.models[modelName]; found {
-			return isReasoning
-		}
-	}
-
-	// Fallback to hardcoded pattern matching (OpenAI Direct, Ollama, or cache miss)
-	model := strings.ToLower(modelName)
-	model = strings.TrimPrefix(model, "azure/")
-	model = strings.TrimPrefix(model, "openai/")
-
-	// Check for o-series reasoning models (o1, o2, o3, o4, etc.)
-	if strings.HasPrefix(model, "o1") ||
-		strings.HasPrefix(model, "o2") ||
-		strings.HasPrefix(model, "o3") ||
-		strings.HasPrefix(model, "o4") {
-		return true
-	}
-
-	// Check for GPT-5 series (gpt-5, gpt-5-mini, gpt-5-turbo, etc.)
-	if strings.HasPrefix(model, "gpt-5") {
-		return true
-	}
-
-	return false
+// SetModelCapabilities caches the capabilities for a (provider, model) combination.
+// This is called after detecting what parameters a specific model supports through adaptive retry.
+// Thread-safe with write lock.
+func SetModelCapabilities(key CacheKey, capabilities *ModelCapabilities) {
+	capabilityCacheMutex.Lock()
+	defer capabilityCacheMutex.Unlock()
+	capabilities.LastChecked = time.Now()
+	modelCapabilityCache[key] = capabilities
 }
 
-// FetchReasoningModels fetches the list of reasoning-capable models from OpenRouter's API.
-// This is called on startup to dynamically detect models that support reasoning,
-// avoiding the need to hardcode model names like deepseek-r1, etc.
-// No authentication required for this endpoint.
-func (c *Config) FetchReasoningModels() error {
-	// Only fetch for OpenRouter
-	if c.DetectProvider() != ProviderOpenRouter {
-		return nil
-	}
-
-	// Create HTTP client with timeout
-	client := &http.Client{
-		Timeout: 10 * time.Second,
-	}
-
-	// OpenRouter provides a filtered endpoint for reasoning models
-	req, err := http.NewRequest("GET", "https://openrouter.ai/api/v1/models?supported_parameters=reasoning", nil)
-	if err != nil {
-		return fmt.Errorf("failed to create request: %w", err)
-	}
-
-	resp, err := client.Do(req)
-	if err != nil {
-		return fmt.Errorf("failed to fetch reasoning models: %w", err)
-	}
-	defer func() {
-		_ = resp.Body.Close()
-	}()
-
-	if resp.StatusCode != http.StatusOK {
-		return fmt.Errorf("unexpected status code: %d", resp.StatusCode)
-	}
-
-	// Parse response
-	var result struct {
-		Data []struct {
-			ID string `json:"id"`
-		} `json:"data"`
-	}
-
-	if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
-		return fmt.Errorf("failed to decode response: %w", err)
-	}
-
-	// Populate cache
-	for _, model := range result.Data {
-		reasoningCache.models[model.ID] = true
+// ShouldUseMaxCompletionTokens determines if we should send max_completion_tokens
+// based on cached model capabilities learned through adaptive detection.
+// No hardcoded model patterns - tries max_completion_tokens for ALL models on first request.
+func (c *Config) ShouldUseMaxCompletionTokens(modelName string) bool {
+	// Build cache key for this (provider, model) combination
+	key := CacheKey{
+		BaseURL: c.OpenAIBaseURL,
+		Model:   modelName,
+	}
+
+	// Check if we have cached knowledge about this specific model
+	caps := GetModelCapabilities(key)
+	if caps != nil {
+		// Cache hit - use learned capability
+		if c.Debug {
+			fmt.Printf("[DEBUG] Cache HIT: %s → max_completion_tokens=%v\n",
+				modelName, caps.UsesMaxCompletionTokens)
+		}
+		return caps.UsesMaxCompletionTokens
 	}
-	reasoningCache.populated = true
 
+	// Cache miss - default to trying max_completion_tokens first
+	// The retry mechanism in handlers.go will detect if it's not supported
+	// and automatically fall back to max_tokens, then cache the result
 	if c.Debug {
-		fmt.Printf("[DEBUG] Cached %d reasoning models from OpenRouter\n", len(result.Data))
+		fmt.Printf("[DEBUG] Cache MISS: %s → will auto-detect (try max_completion_tokens)\n", modelName)
 	}
-
-	return nil
+	return true
 }
diff --git a/internal/config/config_test.go b/internal/config/config_test.go
index acc9f82..41ab2c7 100644
--- a/internal/config/config_test.go
+++ b/internal/config/config_test.go
@@ -1,12 +1,8 @@
 package config
 
 import (
-	"encoding/json"
-	"net/http"
-	"net/http/httptest"
 	"os"
 	"path/filepath"
-	"strings"
 	"testing"
 )
 
@@ -577,461 +573,3 @@ func TestMultipleEnvFiles(t *testing.T) {
 		t.Errorf("Expected local base URL, got %q", cfg.OpenAIBaseURL)
 	}
 }
-
-// TestIsReasoningModelWithHardcodedFallback tests reasoning model detection using hardcoded patterns
-func TestIsReasoningModelWithHardcodedFallback(t *testing.T) {
-	tests := []struct {
-		name             string
-		model            string
-		baseURL          string
-		populateCache    bool
-		expectedReasoning bool
-	}{
-		// OpenAI o-series models (hardcoded fallback)
-		{"o1 model", "o1", "https://api.openai.com/v1", false, true},
-		{"o1-preview model", "o1-preview", "https://api.openai.com/v1", false, true},
-		{"o2 model", "o2", "https://api.openai.com/v1", false, true},
-		{"o3 model", "o3", "https://api.openai.com/v1", false, true},
-		{"o3-mini model", "o3-mini", "https://api.openai.com/v1", false, true},
-		{"o4 model", "o4", "https://api.openai.com/v1", false, true},
-
-		// GPT-5 series models (hardcoded fallback)
-		{"gpt-5 model", "gpt-5", "https://api.openai.com/v1", false, true},
-		{"gpt-5-mini model", "gpt-5-mini", "https://api.openai.com/v1", false, true},
-		{"gpt-5-turbo model", "gpt-5-turbo", "https://api.openai.com/v1", false, true},
-
-		// Azure variants with provider prefix
-		{"azure/o1 model", "azure/o1", "https://azure.openai.com/v1", false, true},
-		{"azure/gpt-5 model", "azure/gpt-5", "https://azure.openai.com/v1", false, true},
-		{"openai/o3 model", "openai/o3", "https://api.openai.com/v1", false, true},
-		{"openai/gpt-5 model", "openai/gpt-5", "https://api.openai.com/v1", false, true},
-
-		// Non-reasoning models
-		{"gpt-4o model", "gpt-4o", "https://api.openai.com/v1", false, false},
-		{"gpt-4-turbo model", "gpt-4-turbo", "https://api.openai.com/v1", false, false},
-		{"gpt-3.5-turbo model", "gpt-3.5-turbo", "https://api.openai.com/v1", false, false},
-		{"claude-sonnet model", "claude-sonnet-4", "https://api.openai.com/v1", false, false},
-
-		// Edge cases
-		{"empty string", "", "https://api.openai.com/v1", false, false},
-		{"ollama prefix", "ollama", "http://localhost:11434/v1", false, false},
-		{"contains o but not o-series", "anthropic", "https://api.openai.com/v1", false, false},
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			cfg := &Config{
-				OpenAIBaseURL: tt.baseURL,
-			}
-
-			// Clear cache to test hardcoded fallback
-			reasoningCache = &ReasoningModelCache{
-				models:    make(map[string]bool),
-				populated: false,
-			}
-
-			result := cfg.IsReasoningModel(tt.model)
-			if result != tt.expectedReasoning {
-				t.Errorf("IsReasoningModel(%q) = %v, expected %v", tt.model, result, tt.expectedReasoning)
-			}
-		})
-	}
-}
-
-// TestIsReasoningModelWithCache tests reasoning model detection using cached OpenRouter data
-func TestIsReasoningModelWithCache(t *testing.T) {
-	// Setup mock cache data
-	mockCache := &ReasoningModelCache{
-		models: map[string]bool{
-			"openai/gpt-5":               true,
-			"google/gemini-2.5-flash":    true,
-			"deepseek/deepseek-r1":       true,
-			"nvidia/nemotron-nano-12b":   true,
-			"anthropic/claude-sonnet-4":  false, // Not in cache
-		},
-		populated: true,
-	}
-
-	tests := []struct {
-		name              string
-		model             string
-		baseURL           string
-		expectedReasoning bool
-	}{
-		// Models in cache
-		{"gpt-5 in cache", "openai/gpt-5", "https://openrouter.ai/api/v1", true},
-		{"gemini in cache", "google/gemini-2.5-flash", "https://openrouter.ai/api/v1", true},
-		{"deepseek-r1 in cache", "deepseek/deepseek-r1", "https://openrouter.ai/api/v1", true},
-		{"nvidia in cache", "nvidia/nemotron-nano-12b", "https://openrouter.ai/api/v1", true},
-
-		// Models not in cache - should fall back to hardcoded patterns
-		{"gpt-5 not cached but matches pattern", "gpt-5", "https://openrouter.ai/api/v1", true},
-		{"o3 not cached but matches pattern", "o3", "https://openrouter.ai/api/v1", true},
-		{"gpt-4o not cached and no pattern", "gpt-4o", "https://openrouter.ai/api/v1", false},
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			// Set mock cache
-			reasoningCache = mockCache
-
-			cfg := &Config{
-				OpenAIBaseURL: tt.baseURL,
-			}
-
-			result := cfg.IsReasoningModel(tt.model)
-			if result != tt.expectedReasoning {
-				t.Errorf("IsReasoningModel(%q) with cache = %v, expected %v", tt.model, result, tt.expectedReasoning)
-			}
-		})
-	}
-
-	// Cleanup
-	reasoningCache = &ReasoningModelCache{
-		models:    make(map[string]bool),
-		populated: false,
-	}
-}
-
-// TestIsReasoningModelProviderSpecific tests that different providers use appropriate detection
-func TestIsReasoningModelProviderSpecific(t *testing.T) {
-	tests := []struct {
-		name              string
-		model             string
-		baseURL           string
-		provider          ProviderType
-		shouldUseCache    bool
-		expectedReasoning bool
-	}{
-		{
-			name:              "OpenRouter uses cache when populated",
-			model:             "google/gemini-2.5-flash",
-			baseURL:           "https://openrouter.ai/api/v1",
-			provider:          ProviderOpenRouter,
-			shouldUseCache:    true,
-			expectedReasoning: true,
-		},
-		{
-			name:              "OpenAI Direct uses hardcoded patterns",
-			model:             "gpt-5",
-			baseURL:           "https://api.openai.com/v1",
-			provider:          ProviderOpenAI,
-			shouldUseCache:    false,
-			expectedReasoning: true,
-		},
-		{
-			name:              "Ollama uses hardcoded patterns",
-			model:             "o1",
-			baseURL:           "http://localhost:11434/v1",
-			provider:          ProviderOllama,
-			shouldUseCache:    false,
-			expectedReasoning: true,
-		},
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			cfg := &Config{
-				OpenAIBaseURL: tt.baseURL,
-			}
-
-			// Setup cache for OpenRouter test
-			if tt.shouldUseCache {
-				reasoningCache = &ReasoningModelCache{
-					models: map[string]bool{
-						"google/gemini-2.5-flash": true,
-					},
-					populated: true,
-				}
-			} else {
-				reasoningCache = &ReasoningModelCache{
-					models:    make(map[string]bool),
-					populated: false,
-				}
-			}
-
-			result := cfg.IsReasoningModel(tt.model)
-			if result != tt.expectedReasoning {
-				t.Errorf("IsReasoningModel(%q) for %v = %v, expected %v",
-					tt.model, tt.provider, result, tt.expectedReasoning)
-			}
-		})
-	}
-
-	// Cleanup
-	reasoningCache = &ReasoningModelCache{
-		models:    make(map[string]bool),
-		populated: false,
-	}
-}
-
-// TestFetchReasoningModels tests the dynamic reasoning model detection from OpenRouter API
-func TestFetchReasoningModels(t *testing.T) {
-	// Helper function to create mock OpenRouter API server
-	createMockServer := func(statusCode int, response string) *httptest.Server {
-		return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-			// Verify the request is for reasoning models
-			if !strings.Contains(r.URL.String(), "supported_parameters=reasoning") {
-				t.Errorf("Expected URL to contain 'supported_parameters=reasoning', got %q", r.URL.String())
-			}
-
-			w.WriteHeader(statusCode)
-			w.Header().Set("Content-Type", "application/json")
-			w.Write([]byte(response))
-		}))
-	}
-
-	t.Run("successful fetch and cache population", func(t *testing.T) {
-		// Clear cache
-		reasoningCache = &ReasoningModelCache{
-			models:    make(map[string]bool),
-			populated: false,
-		}
-
-		// Create mock response matching OpenRouter's actual format
-		mockResponse := `{
-			"data": [
-				{"id": "openai/gpt-5"},
-				{"id": "google/gemini-2.5-flash"},
-				{"id": "deepseek/deepseek-r1"},
-				{"id": "nvidia/nemotron-nano-12b"}
-			]
-		}`
-
-		server := createMockServer(http.StatusOK, mockResponse)
-		defer server.Close()
-
-		// Create config pointing to OpenRouter
-		cfg := &Config{
-			OpenAIBaseURL: "https://openrouter.ai/api/v1",
-		}
-
-		// Temporarily replace the API URL in the function call
-		// Since we can't modify the function, we'll need to test indirectly
-		// by verifying the cache gets populated
-
-		// For this test, we need to manually populate cache as if fetch succeeded
-		// This tests the cache population logic
-		var result struct {
-			Data []struct {
-				ID string `json:"id"`
-			} `json:"data"`
-		}
-		json.Unmarshal([]byte(mockResponse), &result)
-
-		for _, model := range result.Data {
-			reasoningCache.models[model.ID] = true
-		}
-		reasoningCache.populated = true
-
-		// Verify cache was populated
-		if !reasoningCache.populated {
-			t.Error("Expected cache to be populated")
-		}
-
-		if len(reasoningCache.models) != 4 {
-			t.Errorf("Expected 4 models in cache, got %d", len(reasoningCache.models))
-		}
-
-		// Verify specific models are in cache
-		expectedModels := []string{
-			"openai/gpt-5",
-			"google/gemini-2.5-flash",
-			"deepseek/deepseek-r1",
-			"nvidia/nemotron-nano-12b",
-		}
-
-		for _, model := range expectedModels {
-			if !reasoningCache.models[model] {
-				t.Errorf("Expected model %q to be in cache", model)
-			}
-		}
-
-		// Verify cfg.IsReasoningModel works with cached data
-		for _, model := range expectedModels {
-			if !cfg.IsReasoningModel(model) {
-				t.Errorf("Expected IsReasoningModel(%q) to return true", model)
-			}
-		}
-
-		// Cleanup
-		reasoningCache = &ReasoningModelCache{
-			models:    make(map[string]bool),
-			populated: false,
-		}
-	})
-
-	t.Run("non-OpenRouter provider skips fetch", func(t *testing.T) {
-		// Clear cache
-		reasoningCache = &ReasoningModelCache{
-			models:    make(map[string]bool),
-			populated: false,
-		}
-
-		tests := []struct {
-			name    string
-			baseURL string
-		}{
-			{"OpenAI Direct", "https://api.openai.com/v1"},
-			{"Ollama", "http://localhost:11434/v1"},
-			{"Unknown", "https://custom.example.com/v1"},
-		}
-
-		for _, tt := range tests {
-			t.Run(tt.name, func(t *testing.T) {
-				cfg := &Config{
-					OpenAIBaseURL: tt.baseURL,
-				}
-
-				// Call FetchReasoningModels - should return early without error
-				err := cfg.FetchReasoningModels()
-				if err != nil {
-					t.Errorf("Expected no error for non-OpenRouter provider, got %v", err)
-				}
-
-				// Cache should still be empty
-				if reasoningCache.populated {
-					t.Error("Expected cache to remain empty for non-OpenRouter provider")
-				}
-			})
-		}
-	})
-
-	t.Run("empty response from API", func(t *testing.T) {
-		// Clear cache
-		reasoningCache = &ReasoningModelCache{
-			models:    make(map[string]bool),
-			populated: false,
-		}
-
-		// Empty response (no reasoning models available)
-		mockResponse := `{"data": []}`
-
-		// Simulate parsing empty response
-		var result struct {
-			Data []struct {
-				ID string `json:"id"`
-			} `json:"data"`
-		}
-		json.Unmarshal([]byte(mockResponse), &result)
-
-		// Populate cache with empty data
-		for _, model := range result.Data {
-			reasoningCache.models[model.ID] = true
-		}
-		reasoningCache.populated = true
-
-		// Cache should be populated but empty
-		if !reasoningCache.populated {
-			t.Error("Expected cache to be populated even with empty data")
-		}
-
-		if len(reasoningCache.models) != 0 {
-			t.Errorf("Expected 0 models in cache, got %d", len(reasoningCache.models))
-		}
-
-		// Cleanup
-		reasoningCache = &ReasoningModelCache{
-			models:    make(map[string]bool),
-			populated: false,
-		}
-	})
-
-	t.Run("malformed JSON response", func(t *testing.T) {
-		// Clear cache
-		reasoningCache = &ReasoningModelCache{
-			models:    make(map[string]bool),
-			populated: false,
-		}
-
-		malformedJSON := `{"data": [{"id": "openai/gpt-5"` // Missing closing braces
-
-		// Attempt to parse malformed JSON
-		var result struct {
-			Data []struct {
-				ID string `json:"id"`
-			} `json:"data"`
-		}
-		err := json.Unmarshal([]byte(malformedJSON), &result)
-
-		// Should get an error
-		if err == nil {
-			t.Error("Expected error when parsing malformed JSON")
-		}
-
-		// Cache should remain unpopulated on error
-		if reasoningCache.populated {
-			t.Error("Expected cache to remain unpopulated after JSON parse error")
-		}
-	})
-
-	t.Run("cache allows fallback to hardcoded patterns", func(t *testing.T) {
-		// Clear cache
-		reasoningCache = &ReasoningModelCache{
-			models:    make(map[string]bool),
-			populated: false,
-		}
-
-		cfg := &Config{
-			OpenAIBaseURL: "https://openrouter.ai/api/v1",
-		}
-
-		// With empty cache, should fall back to hardcoded patterns
-		hardcodedModels := []string{"o1", "o3", "gpt-5", "gpt-5-mini"}
-
-		for _, model := range hardcodedModels {
-			if !cfg.IsReasoningModel(model) {
-				t.Errorf("Expected IsReasoningModel(%q) to return true via fallback", model)
-			}
-		}
-
-		// Non-reasoning models should still return false
-		nonReasoningModels := []string{"gpt-4o", "gpt-4-turbo", "claude-sonnet-4"}
-
-		for _, model := range nonReasoningModels {
-			if cfg.IsReasoningModel(model) {
-				t.Errorf("Expected IsReasoningModel(%q) to return false", model)
-			}
-		}
-	})
-
-	t.Run("cache overrides hardcoded patterns for OpenRouter", func(t *testing.T) {
-		// Setup cache with a model that wouldn't match hardcoded patterns
-		reasoningCache = &ReasoningModelCache{
-			models: map[string]bool{
-				"google/gemini-2.5-flash": true,
-				"deepseek/deepseek-r1":    true,
-			},
-			populated: true,
-		}
-
-		cfg := &Config{
-			OpenAIBaseURL: "https://openrouter.ai/api/v1",
-		}
-
-		// These models are in cache, should return true
-		if !cfg.IsReasoningModel("google/gemini-2.5-flash") {
-			t.Error("Expected gemini-2.5-flash to be reasoning model (from cache)")
-		}
-
-		if !cfg.IsReasoningModel("deepseek/deepseek-r1") {
-			t.Error("Expected deepseek-r1 to be reasoning model (from cache)")
-		}
-
-		// This model is not in cache, should fall back to hardcoded patterns
-		if !cfg.IsReasoningModel("gpt-5") {
-			t.Error("Expected gpt-5 to be reasoning model (from fallback)")
-		}
-
-		// This model is not in cache and doesn't match patterns
-		if cfg.IsReasoningModel("anthropic/claude-sonnet-4") {
-			t.Error("Expected claude-sonnet-4 to NOT be reasoning model")
-		}
-
-		// Cleanup
-		reasoningCache = &ReasoningModelCache{
-			models:    make(map[string]bool),
-			populated: false,
-		}
-	})
-}
diff --git a/internal/converter/converter.go b/internal/converter/converter.go
index 048c9b6..0b44468 100644
--- a/internal/converter/converter.go
+++ b/internal/converter/converter.go
@@ -137,12 +137,14 @@ func ConvertRequest(claudeReq models.ClaudeRequest, cfg *config.Config) (*models
 		}
 	}
 
-	// Set token limit
+	// Set token limit using adaptive per-model detection
 	if claudeReq.MaxTokens > 0 {
-		// Reasoning models (o1, o3, o4, gpt-5) require max_completion_tokens
-		// instead of the legacy max_tokens parameter.
-		// Uses dynamic detection from OpenRouter API for reasoning models.
-		if cfg.IsReasoningModel(openaiModel) {
+		// Use capability-based detection - NO hardcoded model patterns!
+		// ShouldUseMaxCompletionTokens checks cached per-model capabilities:
+		// - Cache hit: Use learned value (max_completion_tokens or max_tokens)
+		// - Cache miss: Try max_completion_tokens first (will auto-detect via retry)
+		// This works with ANY model/provider without code changes
+		if cfg.ShouldUseMaxCompletionTokens(openaiModel) {
 			openaiReq.MaxCompletionTokens = claudeReq.MaxTokens
 		} else {
 			openaiReq.MaxTokens = claudeReq.MaxTokens
diff --git a/internal/converter/reasoning_model_test.go b/internal/converter/reasoning_model_test.go
index 74f4448..89f617e 100644
--- a/internal/converter/reasoning_model_test.go
+++ b/internal/converter/reasoning_model_test.go
@@ -1,165 +1 @@
 package converter
-
-import (
-	"testing"
-
-	"github.com/claude-code-proxy/proxy/internal/config"
-	"github.com/claude-code-proxy/proxy/pkg/models"
-)
-
-func TestIsReasoningModel(t *testing.T) {
-	// Create a config with OpenAI Direct (uses hardcoded pattern matching)
-	cfg := &config.Config{
-		OpenAIAPIKey:  "test-key",
-		OpenAIBaseURL: "https://api.openai.com/v1",
-	}
-
-	tests := []struct {
-		name     string
-		model    string
-		expected bool
-	}{
-		// GPT-5 series (reasoning models)
-		{"gpt-5", "gpt-5", true},
-		{"gpt-5 uppercase", "GPT-5", true},
-		{"gpt-5-mini", "gpt-5-mini", true},
-		{"gpt-5-turbo", "gpt-5-turbo", true},
-		{"azure/gpt-5", "azure/gpt-5", true},
-		{"openai/gpt-5", "openai/gpt-5", true},
-		{"azure/gpt-5-mini", "azure/gpt-5-mini", true},
-
-		// o-series reasoning models
-		{"o1", "o1", true},
-		{"o1-preview", "o1-preview", true},
-		{"o1-mini", "o1-mini", true},
-		{"o2", "o2", true},
-		{"o2-preview", "o2-preview", true},
-		{"o2-mini", "o2-mini", true},
-		{"o3", "o3", true},
-		{"o3-mini", "o3-mini", true},
-		{"o4", "o4", true},
-		{"o4-turbo", "o4-turbo", true},
-		{"azure/o1", "azure/o1", true},
-		{"azure/o2", "azure/o2", true},
-		{"openai/o3", "openai/o3", true},
-
-		// GPT-4 series (NOT reasoning models)
-		{"gpt-4", "gpt-4", false},
-		{"gpt-4o", "gpt-4o", false},
-		{"gpt-4-turbo", "gpt-4-turbo", false},
-		{"gpt-4.1", "gpt-4.1", false},
-		{"gpt-4o-mini", "gpt-4o-mini", false},
-		{"azure/gpt-4o", "azure/gpt-4o", false},
-		{"openai/gpt-4-turbo", "openai/gpt-4-turbo", false},
-
-		// GPT-3.5 series (NOT reasoning models)
-		{"gpt-3.5-turbo", "gpt-3.5-turbo", false},
-		{"gpt-3.5-turbo-16k", "gpt-3.5-turbo-16k", false},
-
-		// Other models (NOT reasoning models)
-		{"claude-3-opus", "claude-3-opus", false},
-		{"claude-sonnet-4", "claude-sonnet-4", false},
-		{"gemini-pro", "gemini-pro", false},
-		{"llama-3-70b", "llama-3-70b", false},
-
-		// Edge cases
-		{"empty string", "", false},
-		{"o prefix but not reasoning", "ollama", false},
-		{"contains gpt-5 but not start", "meta-gpt-5", false},
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			result := cfg.IsReasoningModel(tt.model)
-			if result != tt.expected {
-				t.Errorf("cfg.IsReasoningModel(%q) = %v, expected %v", tt.model, result, tt.expected)
-			}
-		})
-	}
-}
-
-func TestReasoningModelTokenParameter(t *testing.T) {
-	tests := []struct {
-		name                string
-		model               string
-		maxTokens           int
-		expectMaxTokens     int
-		expectMaxCompletion int
-	}{
-		{
-			name:                "gpt-5 uses max_completion_tokens",
-			model:               "gpt-5",
-			maxTokens:           100,
-			expectMaxTokens:     0,
-			expectMaxCompletion: 100,
-		},
-		{
-			name:                "o1 uses max_completion_tokens",
-			model:               "o1",
-			maxTokens:           200,
-			expectMaxTokens:     0,
-			expectMaxCompletion: 200,
-		},
-		{
-			name:                "o2 uses max_completion_tokens",
-			model:               "o2",
-			maxTokens:           150,
-			expectMaxTokens:     0,
-			expectMaxCompletion: 150,
-		},
-		{
-			name:                "azure/o3 uses max_completion_tokens",
-			model:               "azure/o3",
-			maxTokens:           150,
-			expectMaxTokens:     0,
-			expectMaxCompletion: 150,
-		},
-		{
-			name:                "gpt-4o uses max_tokens",
-			model:               "gpt-4o",
-			maxTokens:           100,
-			expectMaxTokens:     100,
-			expectMaxCompletion: 0,
-		},
-		{
-			name:                "gpt-4-turbo uses max_tokens",
-			model:               "gpt-4-turbo",
-			maxTokens:           200,
-			expectMaxTokens:     200,
-			expectMaxCompletion: 0,
-		},
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			// Create a minimal Claude request
-			claudeReq := models.ClaudeRequest{
-				Model:     tt.model,
-				MaxTokens: tt.maxTokens,
-				Messages: []models.ClaudeMessage{
-					{Role: "user", Content: "test"},
-				},
-			}
-
-			// Create a minimal config
-			cfg := &config.Config{
-				OpenAIAPIKey:  "test-key",
-				OpenAIBaseURL: "https://api.openai.com/v1",
-			}
-
-			// Convert the request
-			openaiReq, err := ConvertRequest(claudeReq, cfg)
-			if err != nil {
-				t.Fatalf("ConvertRequest failed: %v", err)
-			}
-
-			// Verify token parameters
-			if openaiReq.MaxTokens != tt.expectMaxTokens {
-				t.Errorf("MaxTokens = %d, expected %d", openaiReq.MaxTokens, tt.expectMaxTokens)
-			}
-			if openaiReq.MaxCompletionTokens != tt.expectMaxCompletion {
-				t.Errorf("MaxCompletionTokens = %d, expected %d", openaiReq.MaxCompletionTokens, tt.expectMaxCompletion)
-			}
-		})
-	}
-}
diff --git a/internal/server/handlers.go b/internal/server/handlers.go
index abe80ab..f3b8f8c 100644
--- a/internal/server/handlers.go
+++ b/internal/server/handlers.go
@@ -196,73 +196,23 @@ func handleStreamingMessages(c *fiber.Ctx, openaiReq *models.OpenAIRequest, cfg
 			fmt.Printf("[DEBUG] StreamWriter: Starting\n")
 		}
 
-		// Marshal request
-		reqBody, err := json.Marshal(openaiReq)
-		if err != nil {
-			if cfg.Debug {
-				fmt.Printf("[DEBUG] StreamWriter: Failed to marshal: %v\n", err)
-			}
-			writeSSEError(w, fmt.Sprintf("failed to marshal request: %v", err))
-			return
-		}
-
 		if cfg.Debug {
-			fmt.Printf("[DEBUG] StreamWriter: Making request to %s\n", cfg.OpenAIBaseURL+"/chat/completions")
-		}
-
-		// Build API URL
-		apiURL := cfg.OpenAIBaseURL + "/chat/completions"
-
-		// Create HTTP request
-		httpReq, err := http.NewRequest("POST", apiURL, bytes.NewBuffer(reqBody))
-		if err != nil {
-			writeSSEError(w, fmt.Sprintf("failed to create request: %v", err))
-			return
-		}
-
-		// Set headers
-		httpReq.Header.Set("Content-Type", "application/json")
-
-		// Skip auth for Ollama (localhost) - Ollama doesn't require authentication
-		if !cfg.IsLocalhost() {
-			httpReq.Header.Set("Authorization", "Bearer "+cfg.OpenAIAPIKey)
+			fmt.Printf("[DEBUG] StreamWriter: Making streaming request to %s\n", cfg.OpenAIBaseURL+"/chat/completions")
 		}
 
-		// OpenRouter-specific headers for better rate limits
-		if cfg.DetectProvider() == config.ProviderOpenRouter {
-			addOpenRouterHeaders(httpReq, cfg)
-		}
-
-		client := &http.Client{
-			Timeout: 300 * time.Second, // Longer timeout for streaming
-		}
-
-		// Make request
-		resp, err := client.Do(httpReq)
+		// Make streaming request with automatic retry logic
+		resp, err := callOpenAIStream(openaiReq, cfg)
 		if err != nil {
 			if cfg.Debug {
 				fmt.Printf("[DEBUG] StreamWriter: Request failed: %v\n", err)
 			}
-			writeSSEError(w, fmt.Sprintf("request failed: %v", err))
+			writeSSEError(w, fmt.Sprintf("streaming request failed: %v", err))
 			return
 		}
 		defer func() { _ = resp.Body.Close() }()
 
 		if cfg.Debug {
-			fmt.Printf("[DEBUG] StreamWriter: Got response with status %d\n", resp.StatusCode)
-		}
-
-		if resp.StatusCode != http.StatusOK {
-			body, _ := io.ReadAll(resp.Body)
-			if cfg.Debug {
-				fmt.Printf("[DEBUG] StreamWriter: Bad status: %s\n", string(body))
-			}
-			writeSSEError(w, fmt.Sprintf("OpenAI API returned status %d: %s", resp.StatusCode, string(body)))
-			return
-		}
-
-		if cfg.Debug {
-			fmt.Printf("[DEBUG] StreamWriter: Starting streamOpenAIToClaude conversion\n")
+			fmt.Printf("[DEBUG] StreamWriter: Got response, starting streamOpenAIToClaude conversion\n")
 		}
 
 		// Stream conversion
@@ -826,8 +776,187 @@ func writeSSEError(w *bufio.Writer, message string) {
 	_ = w.Flush()
 }
 
-// callOpenAI makes an HTTP request to the OpenAI API
+// callOpenAI makes an HTTP request to the OpenAI API with automatic retry logic
+// for max_completion_tokens parameter errors. Uses per-model capability caching.
 func callOpenAI(req *models.OpenAIRequest, cfg *config.Config) (*models.OpenAIResponse, error) {
+	// Try the request with the configured parameters
+	resp, err := callOpenAIInternal(req, cfg)
+	if err != nil {
+		// Check if this is a max_tokens parameter error
+		if isMaxTokensParameterError(err.Error()) {
+			if cfg.Debug {
+				fmt.Printf("[DEBUG] Detected max_completion_tokens parameter error for model %s, retrying without it\n", req.Model)
+			}
+			// Retry without max_completion_tokens and cache the capability per model
+			return retryWithoutMaxCompletionTokens(req, cfg)
+		}
+		// Other errors - return as-is
+		return nil, err
+	}
+
+	// Success on first try - cache that this (provider, model) supports max_completion_tokens
+	// Only cache if we actually sent max_completion_tokens
+	if req.MaxCompletionTokens > 0 {
+		cacheKey := config.CacheKey{
+			BaseURL: cfg.OpenAIBaseURL,
+			Model:   req.Model,
+		}
+		config.SetModelCapabilities(cacheKey, &config.ModelCapabilities{
+			UsesMaxCompletionTokens: true,
+		})
+		if cfg.Debug {
+			fmt.Printf("[DEBUG] Cached: model %s supports max_completion_tokens\n", req.Model)
+		}
+	}
+
+	return resp, nil
+}
+
+// callOpenAIStream makes a streaming HTTP request with retry logic for parameter errors.
+// Uses per-model capability caching.
+func callOpenAIStream(req *models.OpenAIRequest, cfg *config.Config) (*http.Response, error) {
+	// Try with configured parameters
+	resp, err := callOpenAIStreamInternal(req, cfg)
+	if err != nil {
+		// Check if this is a max_tokens parameter error
+		if isMaxTokensParameterError(err.Error()) {
+			if cfg.Debug {
+				fmt.Printf("[DEBUG] Detected max_completion_tokens parameter error in stream for model %s, retrying without it\n", req.Model)
+			}
+			// Create retry request without max tokens
+			retryReq := *req
+			retryReq.MaxCompletionTokens = 0
+			retryReq.MaxTokens = 0
+
+			// Cache that this (provider, model) doesn't support max_completion_tokens
+			cacheKey := config.CacheKey{
+				BaseURL: cfg.OpenAIBaseURL,
+				Model:   req.Model,
+			}
+			config.SetModelCapabilities(cacheKey, &config.ModelCapabilities{
+				UsesMaxCompletionTokens: false,
+			})
+
+			return callOpenAIStreamInternal(&retryReq, cfg)
+		}
+		return nil, err
+	}
+
+	// Success - cache capability if we sent max_completion_tokens
+	if req.MaxCompletionTokens > 0 {
+		cacheKey := config.CacheKey{
+			BaseURL: cfg.OpenAIBaseURL,
+			Model:   req.Model,
+		}
+		config.SetModelCapabilities(cacheKey, &config.ModelCapabilities{
+			UsesMaxCompletionTokens: true,
+		})
+		if cfg.Debug {
+			fmt.Printf("[DEBUG] Cached: model %s supports max_completion_tokens (streaming)\n", req.Model)
+		}
+	}
+
+	return resp, nil
+}
+
+// callOpenAIStreamInternal makes a streaming HTTP request without retry logic
+func callOpenAIStreamInternal(req *models.OpenAIRequest, cfg *config.Config) (*http.Response, error) {
+	// Marshal request to JSON
+	reqBody, err := json.Marshal(req)
+	if err != nil {
+		return nil, fmt.Errorf("failed to marshal request: %w", err)
+	}
+
+	// Build API URL
+	apiURL := cfg.OpenAIBaseURL + "/chat/completions"
+
+	// Create HTTP request
+	httpReq, err := http.NewRequest("POST", apiURL, bytes.NewBuffer(reqBody))
+	if err != nil {
+		return nil, fmt.Errorf("failed to create request: %w", err)
+	}
+
+	// Set headers
+	httpReq.Header.Set("Content-Type", "application/json")
+
+	// Skip auth for Ollama (localhost)
+	if !cfg.IsLocalhost() {
+		httpReq.Header.Set("Authorization", "Bearer "+cfg.OpenAIAPIKey)
+	}
+
+	// OpenRouter-specific headers
+	if cfg.DetectProvider() == config.ProviderOpenRouter {
+		addOpenRouterHeaders(httpReq, cfg)
+	}
+
+	// Create HTTP client with longer timeout for streaming
+	client := &http.Client{
+		Timeout: 300 * time.Second,
+	}
+
+	// Make request
+	resp, err := client.Do(httpReq)
+	if err != nil {
+		return nil, fmt.Errorf("request failed: %w", err)
+	}
+
+	// Check for errors
+	if resp.StatusCode != http.StatusOK {
+		body, _ := io.ReadAll(resp.Body)
+		_ = resp.Body.Close()
+		return nil, fmt.Errorf("OpenAI API returned status %d: %s", resp.StatusCode, string(body))
+	}
+
+	return resp, nil
+}
+
+// isMaxTokensParameterError checks if the error message indicates an unsupported
+// max_tokens or max_completion_tokens parameter issue.
+// Uses broad keyword matching to handle different error message formats across providers.
+// No status code checking - relies on message content alone.
+func isMaxTokensParameterError(errorMessage string) bool {
+	errorLower := strings.ToLower(errorMessage)
+
+	// Check for parameter error indicators
+	hasParamIndicator := strings.Contains(errorLower, "parameter") ||
+		strings.Contains(errorLower, "unsupported") ||
+		strings.Contains(errorLower, "invalid")
+
+	// Check for our specific parameter names
+	hasOurParam := strings.Contains(errorLower, "max_tokens") ||
+		strings.Contains(errorLower, "max_completion_tokens")
+
+	// Require both indicators to reduce false positives
+	return hasParamIndicator && hasOurParam
+}
+
+// retryWithoutMaxCompletionTokens attempts the request again without max_completion_tokens.
+// Caches the result per (provider, model) combination for future requests.
+func retryWithoutMaxCompletionTokens(req *models.OpenAIRequest, cfg *config.Config) (*models.OpenAIResponse, error) {
+	// Create a copy of the request without max_completion_tokens
+	retryReq := *req
+	retryReq.MaxCompletionTokens = 0
+	retryReq.MaxTokens = 0 // Also clear max_tokens to avoid issues
+
+	if cfg.Debug {
+		fmt.Printf("[DEBUG] Retrying without max_completion_tokens/max_tokens for model: %s\n", req.Model)
+	}
+
+	// Cache that this specific (provider, model) doesn't support max_completion_tokens
+	cacheKey := config.CacheKey{
+		BaseURL: cfg.OpenAIBaseURL,
+		Model:   req.Model,
+	}
+	config.SetModelCapabilities(cacheKey, &config.ModelCapabilities{
+		UsesMaxCompletionTokens: false,
+	})
+
+	// Make the retry request
+	return callOpenAIInternal(&retryReq, cfg)
+}
+
+// callOpenAIInternal is the internal implementation without retry logic
+func callOpenAIInternal(req *models.OpenAIRequest, cfg *config.Config) (*models.OpenAIResponse, error) {
 	// Marshal request to JSON
 	reqBody, err := json.Marshal(req)
 	if err != nil {