Allowed
Hardware bucket, OS/runtime version, model tag, quantization, score, pass rate, failure counts, latency, and safe runtime settings.
SwarmMarshal publishes privacy-safe cloud and local model benchmark rows for function-specific workloads such as message enrichment and AI chat. The desktop app uses these rows as candidate priors, then applies local calibration and routing policy before changing active models.
Promotion gates are function-specific. Message pipeline rows require a 90% pass rate; agent-chat rows allow a stricter review of the near misses while still exposing good local chat candidates.
| Function | Hardware bucket | Model | Score | Pass | Latency | Confidence | Required settings |
|---|---|---|---|---|---|---|---|
| Message pipeline | generic | Claude/claude-sonnet-4-6 |
1.000 | 100% | 23.5s | official-lab | |
| Message pipeline | nvidia-4070-class-64gb | Ollama/hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL |
0.999 | 100% | 24.3s | official-lab | keep_alive=30m think=False |
| Message pipeline | subscription-cli | ClaudeCode/claude-code:opus |
1.000 | 100% | 21.6s | official-lab | provider_kind=subscription-cli requires_subscription=True |
| AI chat assistant | generic | OpenAI/gpt-5.5 |
0.940 | 67% | 34.1s | official-lab | |
| AI chat assistant | nvidia-4070-class-64gb | Ollama/hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL |
0.940 | 67% | 50.6s | official-lab | keep_alive=30m think=False |
These are aggregate calibration results only. They never include raw messages, prompts, model responses, headers, extracted facts, embeddings, or personal data.
Gate: score at least 0.88, pass rate at least 90%, no critical failures, and no parse failures.
| Scope | Hardware bucket | Model | Score | Pass | Latency | Gate | Notes |
|---|---|---|---|---|---|---|---|
| Cloud | generic | Claude/claude-sonnet-4-6 |
1.000 | 100% | 23.5s | Clears | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only. |
| Local | nvidia-4070-class-64gb | Ollama/hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL |
0.999 | 100% | 24.3s | Clears | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only. |
| Cloud | subscription-cli | ClaudeCode/claude-code:opus |
1.000 | 100% | 21.6s | Clears | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only. |
| Cloud | subscription-cli | ClaudeCode/claude-code:sonnet |
0.999 | 100% | 70.3s | Clears | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only. |
| Cloud | generic | OpenAI/gpt-5.5 |
0.998 | 89% | 15.2s | Hold | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only. |
| Cloud | generic | Claude/claude-opus-4-7 |
0.996 | 89% | 17.5s | Hold | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only. |
| Cloud | generic | OpenAI/gpt-5.4-mini |
0.991 | 89% | 5.9s | Hold | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only. |
| Cloud | generic | Claude/claude-haiku-4-5-20251001 |
0.991 | 78% | 9.9s | Hold | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only. |
| Cloud | generic | DeepSeek/deepseek-v4-flash |
0.987 | 67% | 14.5s | Hold | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only. |
| Cloud | generic | OpenAI/gpt-5.4-nano |
0.936 | 67% | 10.6s | Hold | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only. |
| Cloud | generic | DeepSeek/deepseek-v4-pro |
0.466 | 56% | 63.6s | Hold | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only. |
| Local | nvidia-4070-class-64gb | Ollama/qwen3:30b-a3b |
0.987 | 78% | 31.8s | Hold | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only. |
| Local | nvidia-4070-class-64gb | Ollama/qwen3.5:35b-a3b |
0.983 | 67% | 32.4s | Hold | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only. |
| Local | nvidia-4070-class-64gb | Ollama/qwen2.5:14b |
0.968 | 44% | 68.7s | Hold | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only. |
| Cloud | subscription-cli | ClaudeCode/claude-code:haiku |
0.992 | 78% | 60.0s | Hold | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only. |
| Cloud | subscription-cli | CodexCli/codex-cli:default |
0.000 | 0% | 2.1s | Hold | Official seed-v4 production-pipeline calibration refreshed 2026-06-05 UTC for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only. |
Gate: score at least 0.88, pass rate at least 60%, no critical failures, and no parse failures.
| Scope | Hardware bucket | Model | Score | Pass | Latency | Gate | Notes |
|---|---|---|---|---|---|---|---|
| Cloud | generic | OpenAI/gpt-5.5 |
0.940 | 67% | 34.1s | Clears | Official demo-profile assistant harness refreshed 2026-06-05 UTC on three source-backed scenarios: expense itemization, last-month receipts, and mail-thread intelligence. Aggregate metadata only; one strict relative-date clarification near miss. |
| Local | nvidia-4070-class-64gb | Ollama/hf.co/unsloth/gemma-4-12B-it-GGUF:UD-Q4_K_XL |
0.940 | 67% | 50.6s | Clears | Official demo-profile assistant harness refreshed 2026-06-05 UTC on three source-backed scenarios: expense itemization, last-month receipts, and mail-thread intelligence. Aggregate metadata only; one strict relative-date clarification near miss. |
Hardware bucket, OS/runtime version, model tag, quantization, score, pass rate, failure counts, latency, and safe runtime settings.
Raw emails, prompts, model responses, headers, extracted facts, contact names, or any user-derived message content.
The desktop app checks this endpoint daily and caches the bundle locally. It only uses the bundle to rank candidates; local calibration and cooldown rules still decide whether a model is applied.