Test Suite

Test Suite

Four capability tests cover the full agent stack:

Test What it measures
test_ollama_tool_use.py Correct tool invocation via OpenAI-compatible API
test_agent_capability.py 9 realistic scenarios: routing, vault search, error recovery, hallucination
test_zeno_quality.py Data agent (zeno) quality: tool calls + answer correctness, scored 0–3 per query
test_pattern_agent.py Pattern agent (rumi): extract, consolidate, distill, promote, validate

Run against any model:

# Local
python tools/test_agent_capability.py --models gemma4:26b

# Ollama cloud
python tools/test_agent_capability.py \
  --url https://ollama.com \
  --api-key <your-key> \
  --models gemma4:31b

Results

Results

gemma4:31b (Ollama cloud) — 2026-04-15

Tool use

Model Result Time
gemma4:31b 6.1s

Agent capability (9 scenarios)

Scenario Result Notes
relay_route_agenda
relay_route_data
relay_no_tool_needed
data_search_by_tag
data_read_file
data_error_recovery ~ Missing: hiro, mcp
data_hallucination_check Did not return "not found"
agenda_read_today
agenda_add_event ~ Event not written to correct date

6/9 OK, 2 partial, 1 fail

Zeno quality (5 queries, max 15 points)

Query Score
agent-count 3/3
billing-location 3/3
billing-error-raises 3/3
relay-imports 3/3
agent-list 3/3

Total: 15/15

Pattern agent (10 scenarios)

Scenario Result Notes
extract_basic
extract_empty
consolidate_duplicates
consolidate_contradiction ~ Contradiction unresolved (acceptable)
consolidate_clean
distill_ready
distill_not_ready
distill_quality No output after 336s (timeout)
promote_rules
validate_meta ~ References undisclosed context

7/10 OK, 2 partial, 1 fail

Assessment: gemma4:31b cloud is strong for relay and data work (zeno: 15/15). Weak on hallucination rejection and long-context distillation tasks.

glm-5.1 (Ollama cloud) — 2026-04-15

Tool use

Model Result Notes
glm-5.1 ~ Loops on list_files (max iterations reached, 68.6s)

Agent capability (9 scenarios)

Scenario Result Notes
relay_route_agenda
relay_route_data
relay_no_tool_needed
data_search_by_tag
data_read_file
data_error_recovery Multi-step recovery worked (84s)
data_hallucination_check Wrong tool (scan_tags), no "not found"
agenda_read_today
agenda_add_event ~ Event not written correctly

7/9 OK, 1 partial, 1 fail

Zeno quality (5 queries, max 15 points)

Query Score Notes
agent-count 1/3 Tool called, count missing from answer
billing-location 3/3
billing-error-raises 3/3
relay-imports 1/3 Tool called, answer unhelpful
agent-list 1/3 Tool called, agents not extracted

Total: 9/15

Pattern agent (10 scenarios)

Scenario Result Notes
extract_basic
extract_empty ~ 1 low-confidence entry
consolidate_duplicates
consolidate_contradiction ~ Contradiction unresolved (acceptable)
consolidate_clean
distill_ready
distill_not_ready
distill_quality No output (8s — refused)
promote_rules
validate_meta

7/10 OK, 2 partial, 1 fail

Assessment: glm-5.1 cloud performs well on agent routing and error recovery, but is unreliable for deep code analysis (zeno: 9/15). Tool loops are a concern. Not recommended as primary model for data/zeno tasks.

Comparison

Comparison

gemma4:31b glm-5.1
Tool use ~ (loops)
Agent capability 6/9 7/9
Zeno quality 15/15 9/15
Pattern agent 7/10 7/10
Avg response time ~5s ~27s
Recommendation relay, zeno, cato relay, cato

Ollama Cloud Setup

Ollama Cloud Setup

API keys are managed at ollama.com under Usage → API keys.

Config:

"ollama": {
  "cloud": {
    "api_key": "your-key",
    "base_url": "https://ollama.com/v1"
  }
}

Model aliases use "provider": "ollama.cloud". Model names match the Ollama library name without any suffix (e.g. gemma4:31b, not gemma4:31b-cloud).

Available cloud models: ollama.com/search

See Also

See Also

  • Configuration — Full config reference including provider setup