Model Benchmarks - SwarmMarshal

Function charts

Recommended baseline models by function and hardware bucket.

Promotion gates are function-specific. Message pipeline rows require a 90% pass rate; agent-chat rows allow a stricter review of the near misses while still exposing good local chat candidates.

Function	Hardware bucket	Model	Score	Pass	Latency	Confidence	Required settings
Message pipeline	generic	`Claude/claude-sonnet-4-6`	1.000	100%	23.5s	official-lab
Message pipeline	nvidia-4070-class-64gb	`Ollama/hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL`	0.999	100%	24.3s	official-lab	keep_alive=30m think=False
Message pipeline	subscription-cli	`ClaudeCode/claude-code:opus`	1.000	100%	21.6s	official-lab	provider_kind=subscription-cli requires_subscription=True
AI chat assistant	generic	`OpenAI/gpt-5.5`	0.940	67%	34.1s	official-lab
AI chat assistant	nvidia-4070-class-64gb	`Ollama/hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL`	0.940	67%	50.6s	official-lab	keep_alive=30m think=False

Official rows

Full aggregate benchmark charts.

These are aggregate calibration results only. They never include raw messages, prompts, model responses, headers, extracted facts, embeddings, or personal data.

Message pipeline

Gate: score at least 0.88, pass rate at least 90%, no critical failures, and no parse failures.

Scope	Hardware bucket	Model	Score	Pass	Latency	Gate	Notes
Cloud	generic	`Claude/claude-sonnet-4-6`	1.000	100%	23.5s	Clears	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Local	nvidia-4070-class-64gb	`Ollama/hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL`	0.999	100%	24.3s	Clears	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only.
Cloud	subscription-cli	`ClaudeCode/claude-code:opus`	1.000	100%	21.6s	Clears	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only.
Cloud	subscription-cli	`ClaudeCode/claude-code:sonnet`	0.999	100%	70.3s	Clears	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only.
Cloud	generic	`OpenAI/gpt-5.5`	0.998	89%	15.2s	Hold	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud	generic	`Claude/claude-opus-4-7`	0.996	89%	17.5s	Hold	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud	generic	`OpenAI/gpt-5.4-mini`	0.991	89%	5.9s	Hold	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud	generic	`Claude/claude-haiku-4-5-20251001`	0.991	78%	9.9s	Hold	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud	generic	`DeepSeek/deepseek-v4-flash`	0.987	67%	14.5s	Hold	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud	generic	`OpenAI/gpt-5.4-nano`	0.936	67%	10.6s	Hold	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud	generic	`DeepSeek/deepseek-v4-pro`	0.466	56%	63.6s	Hold	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Local	nvidia-4070-class-64gb	`Ollama/qwen3:30b-a3b`	0.987	78%	31.8s	Hold	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only.
Local	nvidia-4070-class-64gb	`Ollama/qwen3.5:35b-a3b`	0.983	67%	32.4s	Hold	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only.
Local	nvidia-4070-class-64gb	`Ollama/qwen2.5:14b`	0.968	44%	68.7s	Hold	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only.
Cloud	subscription-cli	`ClaudeCode/claude-code:haiku`	0.992	78%	60.0s	Hold	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only.
Cloud	subscription-cli	`CodexCli/codex-cli:default`	0.000	0%	2.1s	Hold	Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only.

AI chat assistant

Gate: score at least 0.88, pass rate at least 60%, no critical failures, and no parse failures.

Scope	Hardware bucket	Model	Score	Pass	Latency	Gate	Notes
Cloud	generic	`OpenAI/gpt-5.5`	0.940	67%	34.1s	Clears	Official demo-profile assistant harness refreshed 2026-06-05 UTC on three source-backed scenarios: expense itemization, last-month receipts, and mail-thread intelligence. Aggregate metadata only; one strict relative-date clarification near miss.
Local	nvidia-4070-class-64gb	`Ollama/hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL`	0.940	67%	50.6s	Clears	Official demo-profile assistant harness refreshed 2026-06-05 UTC on three source-backed scenarios: expense itemization, last-month receipts, and mail-thread intelligence. Aggregate metadata only; one strict relative-date clarification near miss.

Community data

What users can safely submit.

Allowed

Hardware bucket, OS/runtime version, model tag, quantization, score, pass rate, failure counts, latency, and safe runtime settings.

Never collected

Raw emails, prompts, model responses, headers, extracted facts, contact names, or any user-derived message content.

Bundle API

Daily app update endpoint.

The desktop app checks this endpoint daily and caches the bundle locally. It only uses the bundle to rank candidates; local calibration and cooldown rules still decide whether a model is applied.

GET /api/model-benchmarks/message-pipeline GET /api/model-benchmarks/agent-chat GET /api/model-benchmarks/text-embedding POST /api/model-benchmarks/message-pipeline/community 2026.06.18 expires Jun 20, 2026

Community-tested baselines by AI function.