Assistant Multi-LLM Provider Abstraction

The assistant module features a resilient, provider-agnostic LLM interface that allows the system to seamlessly switch between different model backends.

The Multi-LLM Architecture

The MultiLLM system (in assistant/multillm.go) acts as a unified registry and dispatcher for all LLM interactions.

Supported Providers

Ollama: Local execution for privacy and speed (Primary: llama3.2).
Anthropic: High-quality reasoning (Claude 3.5 Sonnet).
Gemini: Large context and multimodal capabilities.
DeepSeek / OpenRouter / Grok: Specialized and cost-effective alternatives.
Free Providers: Automatic fallback to free tiers (Minimax, Gemini Lite) when primary models fail.

Model Discovery & Fallback

The system has moved away from hardcoded model assignments in Go code to a dynamic discovery model.

1. Markdown Headers

Agents define their preferred models directly in their prompt .md files using Model: and FallBackModel: headers.

2. Fallback Chain

If the primary model fails (due to timeout, rate limit, or error), the system follows a predefined fallback chain:

FallBackModel: Specified in the agent's markdown.
Global Free Models: Minimax → Gemini Lite → OpenRouter Free → Z-AI Flash.

Reliability Features

Exponential Backoff: Retries failed requests with increasing delays (5s → 15s → 30s → 60s).
Staggered Tool Calls: 1000ms delay between parallel tool invocations to prevent rate limiting.
Token & Cost Tracking: Integrated observability (in assistant/observability.go) tracks token usage and costs per provider.

Component Diagram: LLM Dispatcher

graph LR
    P[Planner] --> M[MultiLLM Dispatcher]
    M --> O[Ollama Client]
    M --> A[Anthropic Client]
    M --> G[Gemini Client]
    M --> OR[OpenRouter Client]
    
    subgraph "Fallback Engine"
        M --> F{Model Fails?}
        F -- Yes --> FB[Free Tiers Cache]
        FB --> Z[Z-AI / Minimax / etc.]
    end

Key Files & Functions

assistant/multillm.go: The central registry and GetLLMClient function.
assistant/llm_utils.go: Common utilities for message conversion and sanitization.
assistant/ollama_client.go: Implementation of the local Ollama client.

Guidance for AI Agents

Model Selection: If you are working on a high-stakes task, request a higher-tier model (e.g., Claude) in your thoughts.
Token Efficiency: Be mindful of context window limits; summarize long outputs before re-injecting them into the history.

Multillm Provider Abstraction