Assistant Multi-LLM Provider Abstraction
The assistant module features a resilient, provider-agnostic LLM interface that allows the system to seamlessly switch between different model backends.
The Multi-LLM Architecture
The MultiLLM system (in assistant/multillm.go) acts as a unified registry and dispatcher for all LLM interactions.
Supported Providers
- Ollama: Local execution for privacy and speed (Primary:
llama3.2). - Anthropic: High-quality reasoning (Claude 3.5 Sonnet).
- Gemini: Large context and multimodal capabilities.
- DeepSeek / OpenRouter / Grok: Specialized and cost-effective alternatives.
- Free Providers: Automatic fallback to free tiers (Minimax, Gemini Lite) when primary models fail.
Model Discovery & Fallback
The system has moved away from hardcoded model assignments in Go code to a dynamic discovery model.
1. Markdown Headers
Agents define their preferred models directly in their prompt .md files using Model: and FallBackModel: headers.
2. Fallback Chain
If the primary model fails (due to timeout, rate limit, or error), the system follows a predefined fallback chain:
- FallBackModel: Specified in the agent's markdown.
- Global Free Models: Minimax → Gemini Lite → OpenRouter Free → Z-AI Flash.
Reliability Features
- Exponential Backoff: Retries failed requests with increasing delays (5s → 15s → 30s → 60s).
- Staggered Tool Calls: 1000ms delay between parallel tool invocations to prevent rate limiting.
- Token & Cost Tracking: Integrated observability (in
assistant/observability.go) tracks token usage and costs per provider.
Component Diagram: LLM Dispatcher
graph LR
P[Planner] --> M[MultiLLM Dispatcher]
M --> O[Ollama Client]
M --> A[Anthropic Client]
M --> G[Gemini Client]
M --> OR[OpenRouter Client]
subgraph "Fallback Engine"
M --> F{Model Fails?}
F -- Yes --> FB[Free Tiers Cache]
FB --> Z[Z-AI / Minimax / etc.]
end
Key Files & Functions
assistant/multillm.go: The central registry andGetLLMClientfunction.assistant/llm_utils.go: Common utilities for message conversion and sanitization.assistant/ollama_client.go: Implementation of the local Ollama client.
Guidance for AI Agents
- Model Selection: If you are working on a high-stakes task, request a higher-tier model (e.g., Claude) in your thoughts.
- Token Efficiency: Be mindful of context window limits; summarize long outputs before re-injecting them into the history.