LLM Providers

Core files: assistant/multillm.go (initialization), assistant/assutils.go (model list, constants)

The assistant framework supports 10 LLM providers plus a direct Perplexity search client. All providers implement the LLMClient interface and are registered in InitLLMClients().

Provider Registry

#	Provider Key	Client File	Timeout	Context	Notes
1	`ollama`	`ollama_client.go`	45 min	Varies	Legacy local models. Shares semaphore with llama.
2	`llama`	`llama_client.go`	45 min	Model-specific	llama.cpp server at AI.dLAN:11434. Shares Ollama semaphore.
3	`google`	`gemini_client.go`	3 min	Model-specific	Gemini 2.5-pro, 2.5-flash, etc.
4	`anthropic`	`anthropic_client.go`	45 min	Model-specific	Claude 3.5 Sonnet, 3 Opus, etc.
5	`grok`	`grok_client.go`	5 min	~128k	grok-3, grok-4.3, grok-4-0709, grok-4.1-fast
6	`kimi`	`kimi_k2.go`	2 min	~200k	kimi-k2, kimi-k2-thinking
7	`deepseek`	`deepseek_client.go`	2 min	~64k	deepseek-chat, deepseek-coder
8	`openrouter`	`openrouter_client.go`	30 min	Dynamic	Gateway to 90+ models. Special features below.
9	`inception`	`inception_client.go`	10 min	128k	Inception Labs API
10	`zai`	`zai_client.go`	30 min	128k	ZAI.ai API. Only initialized if `ZAI_API_KEY` env var set.
11	`nvidia`	`nvidia_client.go`	45 min	Varies	NVIDIA NIM (build.nvidia.com) OpenAI-compatible API.

Perplexity (perplexity_client.go) is NOT in the llmClients map. It's instantiated directly in search_tools.go for web search queries.

Model Identifier Format

All models use provider/model-name format (e.g., llama/qwen3:30b, openrouter/z-ai/glm-5, nvidia/nvidia/zai/glm-5).

The llama/ prefix has replaced ollama/ for local models. All locally-hosted models (previously ollama/*) are now served via llama.cpp server and use the llama/ prefix.

Model Constants

Constant	Value	Location	Purpose
`DefaultModel`	`llama/llama3.2:latest`	assutils.go:30	Fallback when no model specified
`DefaultHighEndModel`	`nvidia/z-ai/glm-5`	assutils.go:31	System-wide high-intelligence model
`FallbackFreeModel`	`nvidia/z-ai/glm-5`	assutils.go:32	Free tier fallback
`FallbackFlashModel`	`llama/qwen3:8b`	assutils.go:33	Flash model fallback
`SummaryModel`	`openrouter/z-ai/glm-4.5-air:free`	assutils.go:34	Conversation summarization
`TranslationModel`	`openrouter/z-ai/glm-4.5-air:free`	assutils.go:35	Translation tasks
`RefinementModel`	`openrouter/z-ai/glm-4.5-air:free`	assutils.go:36	Response refinement
`STTModel`	`google/gemini-2.5-flash`	assutils.go:37	Speech-to-text

Context Overflow Fallback (May 18, 2026)

File: assistant/llama_client.go — ErrContextOverflow sentinel

When llama-server returns a 400 error indicating context length exceeded, the ReAct loop and Single-Shot paths now detect this and trigger a fallback chain:

Current model fails with context overflow
→ Fall back to gemini-2.5-flash-lite
Circuit breaker logic at the 400/402/429 level handles all error types uniformly

This was added in commit 41536d47 to prevent model deadlocks when long conversations exhaust context windows.

OpenRouter Special Features

The OpenRouter client has several features not present in other providers:

Request Jitter: 100-400ms random delay to prevent thundering-herd 429 errors
Free Model Detection: isFreeModel() identifies free-tier models and logs warnings when used with tools
Tool-Aware Fallback: GetFreeModelsFallbackChainToolAware() filters out "flash" models when tools are present (flash models fail tool calls)
400 Error Retry: If a 400 error mentions tool_choice, retries without tool_choice
429 Handling: Reads Retry-After header, progressive backoff (10s, 30s, 60s + jitter)
Cache Metrics: Tracks prompt_tokens_details.cache_hit and cache_miss from OpenRouter responses
Cost Tracking: Extracts usage.cost from response JSON
Prompt Caching: OpenRouter prompt caching enabled (May 9, 2026) — reduces token costs for long conversations

Free Models System

File: assistant/free-models.go (391 lines)

The FreeModelsProvider manages a JSON config (free-models.json) that controls model selection for different scenarios:

Function	Purpose
`GetInterimFreeModel()`	Best model for interim/streaming responses
`GetGreetingFreeModel()`	Best model for greeting messages
`GetFreeModelsFallbackChain()`	Priority-sorted fallback chain (local models first)
`GetFreeModelsFallbackChainToolAware(hasTools)`	Fallback chain excluding "flash" models when tools needed

CRUD Operations: AddModel(), UpdateModel(), RemoveModel(), ReorderModels() — all thread-safe with sync.RWMutex.

Model Caching

GetAvailableModels() caches the model list for 3 hours (modelCacheTTL). On first call, it seeds synchronously from the static availableModels slice, then refreshes in the background by querying Ollama, llama-server, OpenRouter, Google, and Anthropic endpoints.

See also: Search Tools, Assistant Framework