Model benchmarks

Community-tested baselines by AI function.

SwarmMarshal publishes privacy-safe cloud and local model benchmark rows for function-specific workloads such as message enrichment and AI chat. The desktop app uses these rows as candidate priors, then applies local calibration and routing policy before changing active models.

Function charts

Recommended baseline models by function and hardware bucket.

Promotion gates are function-specific. Message pipeline rows require a 90% pass rate; agent-chat rows allow a stricter review of the near misses while still exposing good local chat candidates.

Function Hardware bucket Model Score Pass Latency Confidence Required settings
Message pipeline generic Claude/claude-sonnet-4-6 1.000 100% 23.5s official-lab
Message pipeline nvidia-4070-class-64gb Ollama/hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL 0.999 100% 24.3s official-lab keep_alive=30m think=False
Message pipeline subscription-cli ClaudeCode/claude-code:opus 1.000 100% 21.6s official-lab provider_kind=subscription-cli requires_subscription=True
AI chat assistant generic OpenAI/gpt-5.5 0.940 67% 34.1s official-lab
AI chat assistant nvidia-4070-class-64gb Ollama/hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL 0.940 67% 50.6s official-lab keep_alive=30m think=False
Official rows

Full aggregate benchmark charts.

These are aggregate calibration results only. They never include raw messages, prompts, model responses, headers, extracted facts, embeddings, or personal data.

Message pipeline

Gate: score at least 0.88, pass rate at least 90%, no critical failures, and no parse failures.

Scope Hardware bucket Model Score Pass Latency Gate Notes
Cloud generic Claude/claude-sonnet-4-6 1.000 100% 23.5s Clears Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Local nvidia-4070-class-64gb Ollama/hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL 0.999 100% 24.3s Clears Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only.
Cloud subscription-cli ClaudeCode/claude-code:opus 1.000 100% 21.6s Clears Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only.
Cloud subscription-cli ClaudeCode/claude-code:sonnet 0.999 100% 70.3s Clears Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only.
Cloud generic OpenAI/gpt-5.5 0.998 89% 15.2s Hold Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud generic Claude/claude-opus-4-7 0.996 89% 17.5s Hold Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud generic OpenAI/gpt-5.4-mini 0.991 89% 5.9s Hold Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud generic Claude/claude-haiku-4-5-20251001 0.991 78% 9.9s Hold Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud generic DeepSeek/deepseek-v4-flash 0.987 67% 14.5s Hold Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud generic OpenAI/gpt-5.4-nano 0.936 67% 10.6s Hold Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud generic DeepSeek/deepseek-v4-pro 0.466 56% 63.6s Hold Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Local nvidia-4070-class-64gb Ollama/qwen3:30b-a3b 0.987 78% 31.8s Hold Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only.
Local nvidia-4070-class-64gb Ollama/qwen3.5:35b-a3b 0.983 67% 32.4s Hold Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only.
Local nvidia-4070-class-64gb Ollama/qwen2.5:14b 0.968 44% 68.7s Hold Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only.
Cloud subscription-cli ClaudeCode/claude-code:haiku 0.992 78% 60.0s Hold Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only.
Cloud subscription-cli CodexCli/codex-cli:default 0.000 0% 2.1s Hold Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only.

AI chat assistant

Gate: score at least 0.88, pass rate at least 60%, no critical failures, and no parse failures.

Scope Hardware bucket Model Score Pass Latency Gate Notes
Cloud generic OpenAI/gpt-5.5 0.940 67% 34.1s Clears Official demo-profile assistant harness refreshed 2026-06-05 UTC on three source-backed scenarios: expense itemization, last-month receipts, and mail-thread intelligence. Aggregate metadata only; one strict relative-date clarification near miss.
Local nvidia-4070-class-64gb Ollama/hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL 0.940 67% 50.6s Clears Official demo-profile assistant harness refreshed 2026-06-05 UTC on three source-backed scenarios: expense itemization, last-month receipts, and mail-thread intelligence. Aggregate metadata only; one strict relative-date clarification near miss.
Community data

What users can safely submit.

Allowed

Hardware bucket, OS/runtime version, model tag, quantization, score, pass rate, failure counts, latency, and safe runtime settings.

Never collected

Raw emails, prompts, model responses, headers, extracted facts, contact names, or any user-derived message content.

Bundle API

Daily app update endpoint.

The desktop app checks this endpoint daily and caches the bundle locally. It only uses the bundle to rank candidates; local calibration and cooldown rules still decide whether a model is applied.

GET /api/model-benchmarks/message-pipeline GET /api/model-benchmarks/agent-chat GET /api/model-benchmarks/text-embedding POST /api/model-benchmarks/message-pipeline/community 2026.06.18 expires Jun 20, 2026