Last Tuesday I noticed my Ollama Cloud Pro quota draining faster than usual. Way faster. I had burned through 603 million tokens in seven days without understanding where they went.
I opened my Hermes Agent logs and found something I did not know existed: an auxiliary: block with twelve background tasks. Compression, web extraction, vision, session search, skills matching — all running silently every time I typed a message. Every task was set to provider: auto. And because I had no API keys for the fallback chain, every one silently fell back to kimi-k2.6, my one-trillion-parameter main model.
I had no idea this was happening. The agent was sending eleven background prompts to the same model I was actively chatting with, through the same quota, without showing me the prompts. Compression alone fired 10–20 times per long session, each pass sending the full conversation history.
This is what I fixed immediately.
The Fix
Here is what I changed in the auxiliary block of my ~/.hermes/config.yaml. The complete YAML is in the Full Config section below.
Apply with /reset or restart Hermes. Config changes only take effect on new sessions.
If you just want the config and don’t care about the story, jump to Full Config below, copy it, /reset, done.
How the Routing Works
Twelve tasks used to collapse into one trillion-parameter model. Now they are distributed across six models, from 8B to 1T.
What provider: auto Actually Does
I searched dozens of Hermes guides online. Not a single one mentions the auxiliary block. The official docs describe the YAML structure, but there’s no warning that provider: auto silently falls back to your main model. I only found one video by AI Garage discussing this — nothing else. No blog posts, no Discord threads, no Reddit discussions.
The auto chain is: openrouter → new portal → codex → gemini flash. If none of these backends have an API key configured, it falls back to your main chat model.
So while I was typing one message to k2.6, the agent was sending eleven others to the same model through the same quota, without showing me the prompt. Compression alone was firing 10–20 times per long session, sending the full conversation history every time.
My Ollama Cloud Pro Catalog
I have an Ollama Cloud Pro subscription. Here are the models available in my catalog that matter for routing:
| Model | Size | Strength | Best for |
|---|---|---|---|
kimi-k2.6 |
~1T params, 256K context | Reasoning, architecture, debugging | Main chat only |
kimi-k2.5 |
~1T params | Same family, optimized for long context | Summarization, compression |
qwen3-vl:235b-instruct |
235B params | Multimodal (vision + text) | Screenshots, image analysis |
deepseek-v4-flash |
~20B params | Fast, good at structured output | Safety checks, classification |
gemma3:12b |
12B params | Lightweight, fast | Triage, profile tasks |
rnj-1:8b |
8B params | Cheapest in catalog | Titles, search, skills matching |
gemma4:e2b |
2B params | Smallest | Not used — too weak for any auxiliary task |
I pulled all of these and tested them against the twelve auxiliary tasks. The routing below is the result of that testing.
Twelve Background Tasks
My Hermes version has twelve auxiliary tasks. I ordered them by how much they cost in practice:
| # | Task | What it does | Why it costs money |
|---|---|---|---|
| 1 | compression |
Summarizes conversation when context exceeds limits | Fires 10–20 times per long session. Each pass sends the full conversation history. |
| 2 | web_extract |
Strips HTML boilerplate after web_search
|
Fires every time you search the web. |
| 3 | vision |
Processes screenshots and images | Multimodal tokens are expensive. |
| 4 | flush_memories |
Writes facts to memory files on /new or /exit
|
Runs at every session end. |
| 5 | kanban_decomposer |
Breaks Kanban tasks into steps | Medium complexity, runs on board operations. |
| 6 | curator |
Analyzes skill quality and redundancy | Heavy analysis task. |
| 7 | session_search |
Searches past sessions and summarizes matches | Runs when you look up old conversations. |
| 8 | skills_hub |
Matches your query to installed skills | Runs on most questions. |
| 9 | triage_specifier |
Classifies incoming messages | Binary classification. |
| 10 | approval |
Binary safety check before terminal commands | Simple yes/no on safety. |
| 11 | profile_describer |
Generates profile bio | Rare, lightweight. |
| 12 | title_generation |
Auto-names new sessions | Trivial, runs constantly. |
Note: older Hermes versions (like the one in the AI Garage video below) show eight tasks. My version has twelve — four were added in recent updates.
How I Mapped Models to Tasks
The logic is dead simple: put the lightest model that doesn’t break on each task, and keep k2.6 for actual conversations.
| Task | Model | Why this one |
|---|---|---|
| Main chat | kimi-k2.6 |
Architecture, debugging, discussion. The only task that actually needs a trillion parameters. |
| Compression, web_extract, kanban, curator | kimi-k2.5 |
Same Kimi family, optimized for long context. Summarization quality stays high. Using k2.6 for compression was burning quota for no quality gain. |
| Vision | qwen3-vl:235b-instruct |
Only multimodal model in the catalog at this level. No alternative exists for image analysis. |
| Triage, profile | gemma3:12b |
12B params vs 1T. Classification and bio generation do not need reasoning depth. |
| Approval |
deepseek-v4-flash (~20B) |
Binary safety check. Fast response time matters more than reasoning quality. |
| Titles, search, skills, MCP | rnj-1:8b |
8B params. 125 times lighter than k2.6. The bulk of the savings are here — these tasks run constantly but need minimal intelligence. |
What I Tried First: Local Models
Before settling on cloud routing, I tried running auxiliary tasks locally. I already had gemma4:e2b pulled via Ollama on my machine.
RTX 5070 Ti, 8 GB VRAM. One 6B-parameter model fits. Two is already borderline. Every time Hermes switched from compression to approval, Ollama unloaded one model and loaded another. Five to ten seconds of dead air. The GPU fan kicked in. I lost more time waiting for model swaps than I saved on tokens. I abandoned local auxiliary models the same day.
The Numbers
Here is what the routing actually means in terms of model size:
| Task | Before (default) | After (routed) | Reduction |
|---|---|---|---|
| Titles, search, skills, MCP |
kimi-k2.6 (~1T) |
rnj-1:8b (8B) |
125x lighter |
| Triage, profile |
kimi-k2.6 (~1T) |
gemma3:12b (12B) |
83x lighter |
| Approval |
kimi-k2.6 (~1T) |
deepseek-v4-flash (~20B) |
50x lighter |
| Compression, web_extract |
kimi-k2.6 (~1T) |
kimi-k2.5 (~1T) |
Same family, frees k2.6 for chat |
| Vision |
kimi-k2.6 (~1T) |
qwen3-vl (235B) |
Dedicated multimodal model |
The video by AI Garage measured compression cost directly: Claude Opus at 50K context = 13 cents per pass. Kimi K2 for the same task = 1.9 cents. That is an 85% reduction for a single compression pass. Compression fires 10–20 times per day for heavy users. The author estimated that with default settings, compression alone can cost $60/month with Claude Opus. Routed to a cheaper model, it drops to $9/month.
I cannot confirm exact dollar savings for Ollama Cloud — they do not expose per-call pricing. But the scale difference is unambiguous.
Current Status
| Component | Status | Notes |
|---|---|---|
| Heavy auxiliary on k2.5 | Working | Compression and web_extract no longer block the main model |
| Vision on qwen3-vl | Working | Only multimodal option available |
| Medium tasks on gemma3:12b | Working | Triage and profile classification |
| Approval on deepseek-v4-flash | Working | Fast binary decisions |
| Light tasks on rnj-1:8b | Working | Titles, search, skills, MCP |
provider: auto removed |
Done | Explicit provider on every task |
| Local Ollama auxiliary | Abandoned | VRAM contention on 8 GB laptop |
| Cost tracking per task | Not possible | Ollama Cloud does not expose per-call pricing |
Sessions no longer pause for Compressing messages. The token counter stopped monopolizing k2.6. Quota exhaustion mid-session is gone.
Full Config
Here is the complete auxiliary: block from my ~/.hermes/config.yaml:
auxiliary:
compression:
provider: ollama-cloud
model: kimi-k2.5
timeout: 120
web_extract:
provider: ollama-cloud
model: kimi-k2.5
timeout: 360
kanban_decomposer:
provider: ollama-cloud
model: kimi-k2.5
timeout: 180
curator:
provider: ollama-cloud
model: kimi-k2.5
timeout: 600
vision:
provider: ollama-cloud
model: qwen3-vl:235b-instruct
timeout: 120
download_timeout: 30
triage_specifier:
provider: ollama-cloud
model: gemma3:12b
timeout: 120
profile_describer:
provider: ollama-cloud
model: gemma3:12b
timeout: 60
approval:
provider: ollama-cloud
model: deepseek-v4-flash
timeout: 30
title_generation:
provider: ollama-cloud
model: rnj-1:8b
timeout: 30
session_search:
provider: ollama-cloud
model: rnj-1:8b
timeout: 30
max_concurrency: 3
skills_hub:
provider: ollama-cloud
model: rnj-1:8b
timeout: 30
mcp:
provider: ollama-cloud
model: rnj-1:8b
timeout: 30
Try It
If you run Hermes Agent and have never touched the auxiliary block:
hermes config edit
Find auxiliary:. Set explicit provider and model for every task — the one that handles it without dragging extra parameters. Save. Run /reset.
Then tomorrow run hermes insights --days 1. Your main model should stop eating the entire token budget.
If you use Claude or another frontier model as your main provider, the default config is even more expensive — every background task inherits that model. Route them to smaller models. Or go local if your hardware handles it.
What auxiliary routing are you using? Drop it in the comments.
LinkedIn — https://www.linkedin.com/in/azamat-safarov-4a37b93a7/
X / Twitter — https://x.com/Azamat__Safarov

