Llm Providers

Last updated: May 18, 2026

LLM Providers

Core files: assistant/multillm.go (initialization), assistant/assutils.go (model list, constants)

The assistant framework supports 10 LLM providers plus a direct Perplexity search client. All providers implement the LLMClient interface and are registered in InitLLMClients().

Provider Registry

# Provider Key Client File Timeout Context Notes
1 ollama ollama_client.go 45 min Varies Legacy local models. Shares semaphore with llama.
2 llama llama_client.go 45 min Model-specific llama.cpp server at AI.dLAN:11434. Shares Ollama semaphore.
3 google gemini_client.go 3 min Model-specific Gemini 2.5-pro, 2.5-flash, etc.
4 anthropic anthropic_client.go 45 min Model-specific Claude 3.5 Sonnet, 3 Opus, etc.
5 grok grok_client.go 5 min ~128k grok-3, grok-4.3, grok-4-0709, grok-4.1-fast
6 kimi kimi_k2.go 2 min ~200k kimi-k2, kimi-k2-thinking
7 deepseek deepseek_client.go 2 min ~64k deepseek-chat, deepseek-coder
8 openrouter openrouter_client.go 30 min Dynamic Gateway to 90+ models. Special features below.
9 inception inception_client.go 10 min 128k Inception Labs API
10 zai zai_client.go 30 min 128k ZAI.ai API. Only initialized if ZAI_API_KEY env var set.
11 nvidia nvidia_client.go 45 min Varies NVIDIA NIM (build.nvidia.com) OpenAI-compatible API.

Perplexity (perplexity_client.go) is NOT in the llmClients map. It's instantiated directly in search_tools.go for web search queries.

Model Identifier Format

All models use provider/model-name format (e.g., llama/qwen3:30b, openrouter/z-ai/glm-5, nvidia/nvidia/zai/glm-5).

The llama/ prefix has replaced ollama/ for local models. All locally-hosted models (previously ollama/*) are now served via llama.cpp server and use the llama/ prefix.

Model Constants

Constant Value Location Purpose
DefaultModel llama/llama3.2:latest assutils.go:30 Fallback when no model specified
DefaultHighEndModel nvidia/z-ai/glm-5 assutils.go:31 System-wide high-intelligence model
FallbackFreeModel nvidia/z-ai/glm-5 assutils.go:32 Free tier fallback
FallbackFlashModel llama/qwen3:8b assutils.go:33 Flash model fallback
SummaryModel openrouter/z-ai/glm-4.5-air:free assutils.go:34 Conversation summarization
TranslationModel openrouter/z-ai/glm-4.5-air:free assutils.go:35 Translation tasks
RefinementModel openrouter/z-ai/glm-4.5-air:free assutils.go:36 Response refinement
STTModel google/gemini-2.5-flash assutils.go:37 Speech-to-text

Context Overflow Fallback (May 18, 2026)

File: assistant/llama_client.go — ErrContextOverflow sentinel

When llama-server returns a 400 error indicating context length exceeded, the ReAct loop and Single-Shot paths now detect this and trigger a fallback chain:

  1. Current model fails with context overflow
  2. → Fall back to gemini-2.5-flash-lite
  3. Circuit breaker logic at the 400/402/429 level handles all error types uniformly

This was added in commit 41536d47 to prevent model deadlocks when long conversations exhaust context windows.

OpenRouter Special Features

The OpenRouter client has several features not present in other providers:

  • Request Jitter: 100-400ms random delay to prevent thundering-herd 429 errors
  • Free Model Detection: isFreeModel() identifies free-tier models and logs warnings when used with tools
  • Tool-Aware Fallback: GetFreeModelsFallbackChainToolAware() filters out "flash" models when tools are present (flash models fail tool calls)
  • 400 Error Retry: If a 400 error mentions tool_choice, retries without tool_choice
  • 429 Handling: Reads Retry-After header, progressive backoff (10s, 30s, 60s + jitter)
  • Cache Metrics: Tracks prompt_tokens_details.cache_hit and cache_miss from OpenRouter responses
  • Cost Tracking: Extracts usage.cost from response JSON
  • Prompt Caching: OpenRouter prompt caching enabled (May 9, 2026) — reduces token costs for long conversations

Free Models System

File: assistant/free-models.go (391 lines)

The FreeModelsProvider manages a JSON config (free-models.json) that controls model selection for different scenarios:

Function Purpose
GetInterimFreeModel() Best model for interim/streaming responses
GetGreetingFreeModel() Best model for greeting messages
GetFreeModelsFallbackChain() Priority-sorted fallback chain (local models first)
GetFreeModelsFallbackChainToolAware(hasTools) Fallback chain excluding "flash" models when tools needed

CRUD Operations: AddModel(), UpdateModel(), RemoveModel(), ReorderModels() — all thread-safe with sync.RWMutex.

Model Caching

GetAvailableModels() caches the model list for 3 hours (modelCacheTTL). On first call, it seeds synchronously from the static availableModels slice, then refreshes in the background by querying Ollama, llama-server, OpenRouter, Google, and Anthropic endpoints.


See also: Search Tools, Assistant Framework