Running OpenClaw: A Cost Engineering Analysis of LLM Inference Providers (2026)
A quantitative analysis of running OpenClaw across 10 inference providers — including Kimi K2.5, GLM-5, and M3 Ultra Mac Studio hardware benchmarks. Hard math on cost-per-token, throughput benchmarks, smart routing savings, and projected monthly expenses with inline charts.
Disclosure: This is an automated research report generated by Claude (Anthropic) on February 12, 2026 (updated February 12, 2026). It was commissioned by Optimal as part of internal infrastructure research for deploying autonomous AI agents. Nothing in this report constitutes financial advice. All pricing data sourced from provider documentation and third-party benchmarks as of the publication date.
Executive Summary
OpenClaw is not a model — it is an open-source AI agent orchestration platform (180K+ GitHub stars, MIT license) that connects any LLM to messaging channels (WhatsApp, Telegram, Discord, Slack, iMessage) with autonomous tool execution, persistent memory, and scheduling [1].
The critical cost decision is not OpenClaw itself (free), but which LLM backend to power it. This report benchmarks 10 inference providers and 3 new frontier models — including Kimi K2.5 (Moonshot AI) and GLM-5 (Zhipu AI) — across price, speed, and reliability. We also evaluate the M3 Ultra Mac Studio as a local inference alternative.
Key finding: A well-configured OpenClaw deployment costs $5–30/month for regular use. The new Chinese open-weight models (Kimi K2.5 at $0.60/M input, GLM-5 at $1.00/M input) deliver frontier-class performance at 5–8x less than Claude Opus or GPT-5. Smart routing via ClawRouter reduces costs by 70–78% [2].
| Deployment Profile | Monthly Cost | Model Strategy |
|---|---|---|
| Hobby (10–50 msgs/day) | $0–10 | Ollama local or free-tier APIs |
| Regular (50–200 msgs/day) | $15–30 | DeepSeek V3 + Groq fallback |
| Power (200–500 msgs/day) | $40–100 | ClawRouter multi-model + Kimi K2.5 |
| Enterprise (500+ msgs/day) | $100–800+ | Claude/GPT-5 + smart routing |
| Local hardware (M3 Ultra) | $207/mo amortized | Privacy-first, offline, 671B models |
Part 1: The Provider Landscape (2026 Update)
Ten providers were evaluated across speed, cost, and reliability. The benchmark uses GPT-OSS-120B (open-weights, available cross-provider) for apples-to-apples comparison [3].
Head-to-Head Benchmark: Same Model, Different Providers
| Provider | Speed (tok/s) | TTFT | Input $/1M | Output $/1M | Reliability |
|---|---|---|---|---|---|
| Cerebras | 2,988 | 0.26s | $0.35 | $0.75 | 95%+ |
| Together AI | 917 | 0.78s | $0.15 | $0.60 | 95%+ |
| Fireworks AI | 747 | 0.17s | $0.15 | $0.60 | 95%+ |
| Groq | 456 | 0.19s | $0.15 | $0.60 | 95%+ |
| Baseten | 341 | 0.73s | — | — | 95%+ |
| Clarifai | 313 | 0.27s | $0.09 | $0.09 | 95%+ |
| DeepInfra | 79–258 | 0.23–1.27s | $0.08 | $0.30 | 68–70% |
Takeaway: Cerebras is 3x faster than the next competitor. Fireworks has the lowest latency (0.17s TTFT). DeepInfra is cheapest but unreliable — avoid for production [3].
Frontier Model Pricing (Per 1M Tokens) — Updated Feb 2026
| Model | Input $/1M | Output $/1M | Context | Open Source? |
|---|---|---|---|---|
| DeepSeek V3.2 | $0.25 | $0.38 | 163K | Yes |
| Kimi K2.5 | $0.60 | $3.00 | 256K | Yes (MIT) |
| GLM-5 | $1.00 | $3.20 | 200K | Yes (MIT) |
| Gemini 3 Flash | $0.50 | $3.00 | 1M | No |
| Claude Sonnet 4.5 | $3.00 | $15.00 | 200K | No |
| GPT-5.3 Codex | $3.00 | $12.00 | 256K | No |
| Claude Opus 4.6 | $5.00 | $25.00 | 200K | No |
Part 2: New Contenders — Kimi K2.5 & GLM-5
Kimi K2.5 (Moonshot AI) — Released January 27, 2026
A 1 trillion parameter MoE model (32B active per token, 384 experts, MIT license) with native vision and a 256K context window. Available on Hugging Face and all major providers [9].
| Benchmark | Kimi K2.5 | Claude Opus 4.5 | GPT-5.2 | Llama 3.3 70B |
|---|---|---|---|---|
| MMLU-Pro | 87.1 | ~87.5 | 87.1 | ~80 |
| SWE-Bench Verified | 76.8 | 77.2–82.0 | — | ~45–50 |
| GPQA-Diamond | 87.6 | ~86 | — | ~60 |
| BrowseComp | 60.6–78.4 | 24.1 | 54.9 | N/A |
| HLE-Full (w/ tools) | 50.2 | ~40 | ~45 | N/A |
Pricing across providers:
| Provider | Input $/1M | Output $/1M | Speed (tok/s) |
|---|---|---|---|
| Moonshot (official) | $0.60 | $3.00 | 37 |
| OpenRouter (DeepInfra) | $0.45 | $2.25 | — |
| Fireworks | ~$1.07 blended | — | 219 |
| Together AI | ~$1.07 blended | — | 56 |
| Baseten | — | — | 341 |
Community verdict: "Right up there with Sonnet 4.5" for CRUD web apps. Wins massively on agentic search (BrowseComp). Caveat: K2.5 is ~3x more verbose than Opus — the effective cost savings are closer to 3x, not 9x [9].
GLM-5 (Zhipu AI / Z.ai) — Released February 11, 2026
A 744B MoE model (40–44B active, 256 experts, MIT license) trained entirely on Huawei Ascend chips — zero NVIDIA dependency. Released alongside SLIME, an open-source async RL training framework [10].
| Benchmark | GLM-5 | Claude Opus 4.5 | GPT-5.2 | Kimi K2.5 |
|---|---|---|---|---|
| SWE-Bench Verified | 77.8 | 80.9 | 80.0 | 76.8 |
| BrowseComp | 75.9 | 67.8 | 65.8 | 60.6 |
| HLE-Full (w/ tools) | 50.4 | 43.4 | 45.5 | 50.2 |
| Terminal-Bench 2.0 | 56.2 | 59.3 | 54.0 | 50.8 |
| AIME 2026 I | 92.7 | — | — | 96.1 |
| Hallucination (AA) | Record low | — | — | — |
Z.ai pricing tier (complete):
| Model | Input $/1M | Output $/1M | Notes |
|---|---|---|---|
| GLM-5 (flagship) | $1.00 | $3.20 | New SOTA open model |
| GLM-4.7 | $0.60 | $2.20 | Previous flagship |
| GLM-4.7-FlashX | $0.07 | $0.40 | Budget powerhouse |
| GLM-4.7-Flash | Free | Free | Rate-limited, 200K ctx |
| GLM-4.5-Flash | Free | Free | Rate-limited |
Western access: Available day-1 on OpenRouter ($0.80/$2.56 via AtlasCloud), DeepInfra, and Vercel AI Gateway. No VPN needed.
Compliance note: Z.ai remains on the U.S. Commerce Department Entity List (since Jan 2025). Use GLM models via western providers (OpenRouter, Fireworks) to mitigate regulatory risk [6].
The Chinese Open-Weight Value Play
<div style="max-width:620px;margin:2rem auto;"> <svg viewBox="0 0 620 260" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;font-family:ui-monospace,monospace;"> <rect width="620" height="260" rx="12" fill="#1a1a2e" stroke="#2a2a4a" stroke-width="1"/> <text x="310" y="28" text-anchor="middle" fill="#e2e8f0" font-size="13" font-weight="600">Benchmark Score vs Output Price (SWE-Bench Verified)</text> <text x="310" y="46" text-anchor="middle" fill="#64748b" font-size="10">Higher = better code. Leftward = cheaper. Best position = top-left.</text> <line x1="70" y1="60" x2="70" y2="210" stroke="#2a2a4a" stroke-width="0.5"/> <line x1="70" y1="210" x2="590" y2="210" stroke="#2a2a4a" stroke-width="0.5"/> <text x="66" y="74" text-anchor="end" fill="#64748b" font-size="9">82%</text> <text x="66" y="114" text-anchor="end" fill="#64748b" font-size="9">78%</text> <text x="66" y="154" text-anchor="end" fill="#64748b" font-size="9">74%</text> <text x="66" y="194" text-anchor="end" fill="#64748b" font-size="9">70%</text> <line x1="70" y1="70" x2="590" y2="70" stroke="#2a2a4a" stroke-width="0.3" stroke-dasharray="4"/> <line x1="70" y1="110" x2="590" y2="110" stroke="#2a2a4a" stroke-width="0.3" stroke-dasharray="4"/> <line x1="70" y1="150" x2="590" y2="150" stroke="#2a2a4a" stroke-width="0.3" stroke-dasharray="4"/> <line x1="70" y1="190" x2="590" y2="190" stroke="#2a2a4a" stroke-width="0.3" stroke-dasharray="4"/> <text x="90" y="226" fill="#64748b" font-size="9">$0.38</text> <text x="170" y="226" fill="#64748b" font-size="9">$3.00</text> <text x="280" y="226" fill="#64748b" font-size="9">$12</text> <text x="380" y="226" fill="#64748b" font-size="9">$15</text> <text x="530" y="226" fill="#64748b" font-size="9">$25</text> <text x="330" y="246" text-anchor="middle" fill="#64748b" font-size="9">Output cost per 1M tokens →</text> <circle cx="90" cy="194" r="8" fill="#10b981" opacity="0.8"/> <text x="90" y="188" text-anchor="middle" fill="#6ee7b7" font-size="8">DSv3.2</text> <text x="90" y="204" text-anchor="middle" fill="#6ee7b7" font-size="7">70.2%</text> <circle cx="168" cy="102" r="10" fill="#06b6d4" opacity="0.8"/> <text x="168" y="96" text-anchor="middle" fill="#67e8f9" font-size="8">K2.5</text> <text x="168" y="112" text-anchor="middle" fill="#67e8f9" font-size="7">76.8%</text> <circle cx="182" cy="90" r="10" fill="#8b5cf6" opacity="0.8"/> <text x="182" y="84" text-anchor="middle" fill="#c4b5fd" font-size="8">GLM-5</text> <text x="182" y="100" text-anchor="middle" fill="#c4b5fd" font-size="7">77.8%</text> <circle cx="290" cy="78" r="9" fill="#f97316" opacity="0.8"/> <text x="290" y="72" text-anchor="middle" fill="#fdba74" font-size="8">GPT-5.2</text> <text x="290" y="88" text-anchor="middle" fill="#fdba74" font-size="7">80.0%</text> <circle cx="390" cy="82" r="9" fill="#ec4899" opacity="0.8"/> <text x="390" y="76" text-anchor="middle" fill="#f9a8d4" font-size="8">Sonnet</text> <text x="390" y="92" text-anchor="middle" fill="#f9a8d4" font-size="7">77.2%</text> <circle cx="540" cy="70" r="9" fill="#ef4444" opacity="0.8"/> <text x="540" y="64" text-anchor="middle" fill="#fca5a5" font-size="8">Opus</text> <text x="540" y="80" text-anchor="middle" fill="#fca5a5" font-size="7">80.9%</text> </svg> </div>The bottom line: Kimi K2.5 and GLM-5 are within 3–4 points of Claude Opus on SWE-Bench Verified — at $3.00–3.20/M output vs $25.00/M output. That's a 7–8x cost reduction for ~96% of the coding capability. Both are fully MIT-licensed with open weights.
Part 3: The Math — Cost Modeling
Token Economics Primer
A typical OpenClaw conversation turn consumes:
- System prompt + memory context: ~2,000 tokens (input)
- User message: ~100 tokens (input)
- Tool calls + results: ~500 tokens (input/output)
- Agent response: ~300 tokens (output)
Per-turn total: ~2,600 input + ~800 output tokens
Cost Per Turn by Provider (Including New Models)
| Provider | Model | Cost/Turn | 100 turns/day | 30-day cost |
|---|---|---|---|---|
| Together AI | Gemma 3n E4B | $0.000084 | $0.008 | $0.25 |
| Groq | Llama 3.1 8B | $0.000194 | $0.019 | $0.58 |
| Cerebras | Llama 3.1 8B | $0.000340 | $0.034 | $1.02 |
| Groq | GPT-OSS-120B | $0.000870 | $0.087 | $2.61 |
| OpenRouter | DeepSeek V3.2 | $0.000954 | $0.095 | $2.87 |
| OpenRouter | Kimi K2.5 | $0.002970 | $0.297 | $8.91 |
| Z.ai | GLM-5 | $0.005160 | $0.516 | $15.48 |
| Z.ai | GLM-4.7-Flash | $0.000000 | $0.000 | $0.00 |
| Groq | Llama 3.3 70B | $0.002166 | $0.217 | $6.50 |
| Direct | Claude Sonnet 4.5 | $0.019800 | $1.980 | $59.40 |
| Direct | Claude Opus 4.6 | $0.033000 | $3.300 | $99.00 |
Formula: Cost/turn = (input_tokens × input_price/1M) + (output_tokens × output_price/1M)
Kimi K2.5 (OpenRouter): (2,600 × $0.45/1M) + (800 × $2.25/1M) = $0.001170 + $0.001800 = $0.002970
GLM-5 (Z.ai): (2,600 × $1.00/1M) + (800 × $3.20/1M) = $0.002600 + $0.002560 = $0.005160
Monthly Cost Projection — Visual Scaling
<div style="max-width:620px;margin:2rem auto;"> <svg viewBox="0 0 620 340" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;font-family:ui-monospace,monospace;"> <rect width="620" height="340" rx="12" fill="#1a1a2e" stroke="#2a2a4a" stroke-width="1"/> <text x="310" y="28" text-anchor="middle" fill="#e2e8f0" font-size="13" font-weight="600">Monthly Cost by Daily Message Volume</text> <text x="50" y="62" text-anchor="end" fill="#94a3b8" font-size="10">$600</text> <text x="50" y="102" text-anchor="end" fill="#94a3b8" font-size="10">$450</text> <text x="50" y="142" text-anchor="end" fill="#94a3b8" font-size="10">$300</text> <text x="50" y="182" text-anchor="end" fill="#94a3b8" font-size="10">$150</text> <text x="50" y="222" text-anchor="end" fill="#94a3b8" font-size="10">$50</text> <text x="50" y="262" text-anchor="end" fill="#94a3b8" font-size="10">$0</text> <line x1="60" y1="58" x2="580" y2="58" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="98" x2="580" y2="98" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="138" x2="580" y2="138" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="178" x2="580" y2="178" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="218" x2="580" y2="218" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="258" x2="580" y2="258" stroke="#94a3b8" stroke-width="0.5"/> <text x="100" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">25/day</text> <text x="204" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">100/day</text> <text x="308" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">250/day</text> <text x="412" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">500/day</text> <text x="516" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">1000/day</text> <polyline points="100,257 204,257 308,256 412,255 516,254" fill="none" stroke="#10b981" stroke-width="2.5" stroke-linecap="round"/> <polyline points="100,255 204,249 308,236 412,214 516,170" fill="none" stroke="#06b6d4" stroke-width="2.5" stroke-linecap="round"/> <polyline points="100,253 204,241 308,215 412,172 516,86" fill="none" stroke="#8b5cf6" stroke-width="2.5" stroke-linecap="round"/> <polyline points="100,243 204,218 308,178 412,98 516,58" fill="none" stroke="#ec4899" stroke-width="2.5" stroke-linecap="round"/> <polyline points="100,225 204,178 308,98 412,58 516,58" fill="none" stroke="#ef4444" stroke-width="2.5" stroke-linecap="round" stroke-dasharray="6,3"/> <circle cx="100" cy="257" r="3" fill="#10b981"/><circle cx="204" cy="257" r="3" fill="#10b981"/><circle cx="308" cy="256" r="3" fill="#10b981"/><circle cx="412" cy="255" r="3" fill="#10b981"/><circle cx="516" cy="254" r="3" fill="#10b981"/> <circle cx="100" cy="255" r="3" fill="#06b6d4"/><circle cx="204" cy="249" r="3" fill="#06b6d4"/><circle cx="308" cy="236" r="3" fill="#06b6d4"/><circle cx="412" cy="214" r="3" fill="#06b6d4"/><circle cx="516" cy="170" r="3" fill="#06b6d4"/> <circle cx="100" cy="253" r="3" fill="#8b5cf6"/><circle cx="204" cy="241" r="3" fill="#8b5cf6"/><circle cx="308" cy="215" r="3" fill="#8b5cf6"/><circle cx="412" cy="172" r="3" fill="#8b5cf6"/><circle cx="516" cy="86" r="3" fill="#8b5cf6"/> <circle cx="100" cy="243" r="3" fill="#ec4899"/><circle cx="204" cy="218" r="3" fill="#ec4899"/><circle cx="308" cy="178" r="3" fill="#ec4899"/><circle cx="412" cy="98" r="3" fill="#ec4899"/> <rect x="90" y="290" width="12" height="3" rx="1" fill="#10b981"/><text x="108" y="294" fill="#94a3b8" font-size="9">Groq 8B ($0.58)</text> <rect x="195" y="290" width="12" height="3" rx="1" fill="#06b6d4"/><text x="213" y="294" fill="#94a3b8" font-size="9">Kimi K2.5 ($8.91)</text> <rect x="310" y="290" width="12" height="3" rx="1" fill="#8b5cf6"/><text x="328" y="294" fill="#94a3b8" font-size="9">GLM-5 ($15.48)</text> <rect x="405" y="290" width="12" height="3" rx="1" fill="#ec4899"/><text x="423" y="294" fill="#94a3b8" font-size="9">Sonnet ($59)</text> <rect x="495" y="290" width="12" height="3" rx="1" fill="#ef4444"/><text x="513" y="294" fill="#94a3b8" font-size="9">Opus ($99)</text> <text x="310" y="332" text-anchor="middle" fill="#64748b" font-size="9" font-style="italic">Monthly cost at 100 msgs/day reference point. Opus/Sonnet scale exceeds chart at high volumes.</text> </svg> </div>| Daily Messages | Groq 8B | Kimi K2.5 | GLM-5 | GLM-4.7-Flash | Sonnet 4.5 | Opus 4.6 |
|---|---|---|---|---|---|---|
| 25 | $0.15 | $2.23 | $3.87 | $0.00 | $14.85 | $24.75 |
| 100 | $0.58 | $8.91 | $15.48 | $0.00 | $59.40 | $99.00 |
| 250 | $1.46 | $22.28 | $38.70 | $0.00 | $148.50 | $247.50 |
| 500 | $2.91 | $44.55 | $77.40 | $0.00 | $297.00 | $495.00 |
| 1,000 | $5.82 | $89.10 | $154.80 | $0.00 | $594.00 | $990.00 |
The standout: GLM-4.7-Flash is genuinely free (rate-limited) with 200K context. For agent prototyping and high-volume tool-calling, this is unbeatable at $0.
Part 4: Smart Routing — The 78% Cost Reduction
The Problem
A naive OpenClaw config sends every request — including "what time is it?" and "check my calendar" — to the same model. If that model is Claude Opus at $25/1M output tokens, you're paying frontier prices for trivial tasks.
ClawRouter Architecture
ClawRouter analyzes each prompt locally (<1ms, zero API calls) using a 14-dimension scoring system and routes to the cheapest capable model [2].
| Tier | % of Requests | Route To | Avg Cost/Turn | Weighted Cost |
|---|---|---|---|---|
| SIMPLE (45%) | Status, math, time | GLM-4.7-Flash | $0.0000 | $0.000000 |
| MODERATE (30%) | Summarize, translate | Kimi K2.5 | $0.0030 | $0.000900 |
| COMPLEX (20%) | Coding, reasoning | GLM-5 or Sonnet | $0.0052 | $0.001040 |
| EXPERT (5%) | Architecture, research | Claude Opus 4.6 | $0.0330 | $0.001650 |
| Blended | 100% | $0.003590 |
Without routing (all Claude Opus): $0.033/turn With routing (blended, using new models): $0.0036/turn Savings: 89.1% (up from 81.7% before Kimi K2.5 and GLM free tier)
One developer reported their Anthropic bill dropped from $4,660/month to ~$1,400/month using ClawRouter — a 70% reduction — because ~60% of their agent's requests were simple enough for budget models [2].
OpenRouter Alternative
OpenRouter offers similar capability through its Auto Model feature [4]:
| Feature | OpenRouter | ClawRouter |
|---|---|---|
| Routing location | Server-side | Local (<1ms) |
| BYOK support | Yes (1M free req/month) | N/A |
| Fee structure | 0% markup + 5% BYOK after 1M | Free (open-source) |
| Model access | 400+ models (incl. Kimi K2.5, GLM-5) | Configure your own |
| Failover | Automatic (50+ providers) | Manual config |
Part 5: Local Hardware — M3 Ultra Mac Studio
Why Consider Local Inference?
Cloud APIs win on per-token cost for most workloads. But local hardware makes sense when: (1) privacy/data sovereignty is non-negotiable, (2) you need offline access, (3) you want to run massive 400B+ models, or (4) you're deploying fine-tuned models not available via API.
M3 Ultra Mac Studio Specifications
| Spec | M3 Ultra (32C/80G) |
|---|---|
| CPU | 32-core (24P + 8E) |
| GPU | 80-core |
| Max Unified Memory | 512GB LPDDR5x |
| Memory Bandwidth | 819 GB/s |
| TDP | ~100W under load |
| Price (256GB) | $6,999 |
| Price (512GB) | $9,499 |
Key advantage: 512GB of unified memory in a single box — enough to run DeepSeek V3 671B quantized. An equivalent GPU setup (8x H100) would cost $200K+.
Tokens/Second Benchmarks (MLX Framework)
<div style="max-width:620px;margin:2rem auto;"> <svg viewBox="0 0 620 380" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;font-family:ui-monospace,monospace;"> <rect width="620" height="380" rx="12" fill="#1a1a2e" stroke="#2a2a4a" stroke-width="1"/> <text x="310" y="28" text-anchor="middle" fill="#e2e8f0" font-size="13" font-weight="600">M3 Ultra 512GB: Generation Speed by Model (MLX)</text> <text x="130" y="62" text-anchor="end" fill="#e2e8f0" font-size="10">Gemma 3 1B (Q4)</text> <rect x="140" y="49" width="395" height="20" rx="3" fill="#10b981" opacity="0.85"/> <text x="545" y="63" fill="#6ee7b7" font-size="10" font-weight="600">237 t/s</text> <text x="130" y="90" text-anchor="end" fill="#e2e8f0" font-size="10">Gemma 3 4B (Q4)</text> <rect x="140" y="77" width="223" height="20" rx="3" fill="#10b981" opacity="0.75"/> <text x="373" y="91" fill="#6ee7b7" font-size="10" font-weight="600">134 t/s</text> <text x="130" y="118" text-anchor="end" fill="#e2e8f0" font-size="10">Llama 3.1 8B (4-bit)</text> <rect x="140" y="105" width="217" height="20" rx="3" fill="#3b82f6" opacity="0.85"/> <text x="367" y="119" fill="#93c5fd" font-size="10" font-weight="600">130 t/s</text> <text x="130" y="146" text-anchor="end" fill="#e2e8f0" font-size="10">QwQ 32B (4-bit)</text> <rect x="140" y="133" width="58" height="20" rx="3" fill="#f59e0b" opacity="0.85"/> <text x="208" y="147" fill="#fcd34d" font-size="10" font-weight="600">35 t/s</text> <text x="130" y="174" text-anchor="end" fill="#e2e8f0" font-size="10">Qwen3 235B (FP8)</text> <rect x="140" y="161" width="50" height="20" rx="3" fill="#8b5cf6" opacity="0.85"/> <text x="200" y="175" fill="#c4b5fd" font-size="10" font-weight="600">30 t/s</text> <text x="130" y="202" text-anchor="end" fill="#e2e8f0" font-size="10">DeepSeek V3 671B</text> <rect x="140" y="189" width="35" height="20" rx="3" fill="#ec4899" opacity="0.85"/> <text x="185" y="203" fill="#f9a8d4" font-size="10" font-weight="600">21 t/s</text> <text x="130" y="230" text-anchor="end" fill="#e2e8f0" font-size="10">DeepSeek R1 671B</text> <rect x="140" y="217" width="33" height="20" rx="3" fill="#ec4899" opacity="0.75"/> <text x="183" y="231" fill="#f9a8d4" font-size="10" font-weight="600">20 t/s</text> <text x="130" y="258" text-anchor="end" fill="#e2e8f0" font-size="10">Llama 3.3 70B (Q4)</text> <rect x="140" y="245" width="28" height="20" rx="3" fill="#3b82f6" opacity="0.75"/> <text x="178" y="259" fill="#93c5fd" font-size="10" font-weight="600">17 t/s</text> <text x="130" y="286" text-anchor="end" fill="#e2e8f0" font-size="10">GLM-4.7 358B (Q3)</text> <rect x="140" y="273" width="25" height="20" rx="3" fill="#8b5cf6" opacity="0.75"/> <text x="175" y="287" fill="#c4b5fd" font-size="10" font-weight="600">15 t/s</text> <text x="310" y="318" text-anchor="middle" fill="#94a3b8" font-size="10">MoE models (DeepSeek, Qwen3 235B) outperform their size class</text> <text x="310" y="334" text-anchor="middle" fill="#94a3b8" font-size="10">because only active experts (~37B) need memory bandwidth per token</text> <text x="310" y="360" text-anchor="middle" fill="#64748b" font-size="9" font-style="italic">Sources: Hardware Corner, Lattice, Creative Strategies, MacStories — Feb 2026</text> </svg> </div>| Model | Size | Quant | tok/s (gen) | RAM Needed |
|---|---|---|---|---|
| Gemma 3 1B | 1B | Q4 | 237 | <4GB |
| Llama 3.1 8B | 8B | 4-bit | 130 | ~5GB |
| QwQ 32B | 32B | 4-bit | 35 | ~20GB |
| Qwen3 235B (MoE) | 235B | FP8 | 30 | ~256GB |
| DeepSeek V3 (MoE) | 671B | 4-bit | 21 | ~405GB |
| DeepSeek R1 (MoE) | 671B | 4-bit | 20 | ~405GB |
| Llama 3.3 70B | 70B | Q4_K_M | 17 | ~40GB |
| GLM-4.7 358B (MoE) | 358B | Q3 | ~15 | ~256GB |
Critical context: DeepSeek V3's speed drops from 21 tok/s to 5.8 tok/s as context grows from 69 to 16K tokens. The KV cache competes for memory bandwidth. Plan for 10–15 tok/s at realistic conversation lengths.
Framework Choice Matters
| Framework | Speed vs MLX | Best For |
|---|---|---|
| MLX (Apple native) | Baseline (fastest) | Maximum Apple Silicon performance |
| LM Studio (MLX backend) | ~Same | GUI + ease of use |
| llama.cpp | 20–50% slower | Cross-platform, broader model support |
| Ollama | 20–40% slower | Easiest setup, REST API |
Use MLX for best performance. For DeepSeek V3 671B: MLX achieved 21 tok/s vs llama.cpp's 6.2 tok/s — a 3.4x difference [11].
Cloud API vs. Local Hardware Break-Even
| Usage Pattern | M3 Ultra 512GB ($/M tokens) | Cheapest Cloud (DeepInfra) | Winner |
|---|---|---|---|
| Light (2M tok/day) | $3.45/M | $0.30/M | Cloud wins 11x |
| Medium (10M tok/day) | $0.69/M | $0.30/M | Cloud wins 2.3x |
| Heavy 24/7 | $5.30/M | $0.30/M | Cloud wins |
| Privacy/offline | Priceless | N/A | Local wins |
| Run 671B models | $0.69–5.30/M | $3.00–25.00/M (API) | Local wins |
The uncomfortable truth: At current cloud API pricing, local inference rarely breaks even on pure cost. Cloud providers benefit from massive batch sizes, FP8 hardware optimization, and economies of scale. The M3 Ultra's value proposition is the 512GB unified memory pool — running models that would require $200K in GPU hardware otherwise.
M5 Ultra (expected late 2026): ~1,100 GB/s bandwidth (+34%), ~768GB max RAM, 3–4x faster prompt processing. Worth waiting for if not urgent [12].
Part 6: Security Cost — The Hidden Variable
OpenClaw's permissionless architecture introduces non-trivial security costs [8]:
| Incident | Date | Impact |
|---|---|---|
| 40,000+ exposed instances | Jan 2026 | 93.4% had auth bypass flaws |
| Moltbook data leak | Jan 2026 | 1.5M API tokens, 35K emails exposed |
| Scam token via hijacked agent | Jan 2026 | $16M in losses |
| Prompt injection backdoors | Ongoing | Agents execute attacker instructions |
Mitigation cost considerations:
| Security Measure | Implementation | Cost Impact |
|---|---|---|
| Authentication (required) | Gateway token auth | +$0 (config change) |
| Docker sandboxing | Per-session containers | +10–20% memory overhead |
| Prompt injection defense | Latest-gen models only | +$10–50/month (use Claude over Llama) |
| NanoClaw | Hardened fork | $0 (open-source) |
Conclusion & Updated Recommendation
Cost-Optimized OpenClaw Stack for Optimal (February 2026)
| Layer | Choice | Monthly Cost |
|---|---|---|
| Hosting | Hetzner CX22 VPS | $4.15 |
| Agent Platform | OpenClaw (MIT license) | $0 |
| LLM Router | OpenRouter (BYOK, 1M free req/mo) | $0 |
| Simple tasks (45%) | GLM-4.7-Flash (free tier) | $0 |
| Moderate tasks (30%) | Kimi K2.5 via OpenRouter | ~$5–10 |
| Complex tasks (20%) | GLM-5 or Claude Sonnet 4.5 | ~$5–15 |
| Expert tasks (5%) | Claude Opus 4.6 via BYOK | ~$3–8 |
| Security | NanoClaw fork + Docker sandboxing | $0 |
| Total projected | $17–37/month |
Decision Matrix (Updated)
| If you need... | Choose... | Why |
|---|---|---|
| Fastest inference | Cerebras | 2,988 tok/s, 3x faster than #2 |
| Lowest latency | Fireworks AI | 0.17s TTFT |
| Best free tier | GLM-4.7-Flash | Free, 200K context, genuinely capable |
| Best open-weight frontier | GLM-5 or Kimi K2.5 | MIT license, 96% of Opus quality, 7–8x cheaper |
| Best agentic model | Kimi K2.5 (Swarm mode) | BrowseComp 78.4%, HLE-tools 50.2% |
| Best cost/performance | DeepSeek V3.2 | $0.38/M output, solid for moderate tasks |
| Maximum cost control | ClawRouter + multi-model | 89% savings via smart routing |
| Run 671B models locally | M3 Ultra 512GB | Only $9.5K box that fits DeepSeek V3 |
| Zero spend | Oracle Free + Ollama | $0/month, limited capability |
| Regulatory safety | Avoid Z.ai direct | Use GLM models via OpenRouter/Fireworks |
This report was generated on February 12, 2026 using parallel AI research agents (Claude, Anthropic). Updated with Kimi K2.5, GLM-5, and M3 Ultra Mac Studio benchmarks on the same date. All claims are hyperlinked to their sources. This is not financial advice.
Automated report produced for Optimal | Technology Category
Sources & References
<a id="ref-1"></a>[1] OpenClaw Official Documentation — Architecture, deployment guides, and hardware requirements.
<a id="ref-2"></a>[2] ClawRouter — Smart LLM Router — Open-source routing layer claiming 78% cost savings. See also: ClawRouter: How I Cut My $4,660 Bill by 70% (Medium).
<a id="ref-3"></a>[3] Open Source AI API Providers: Speed, Cost & Performance Compared (2026) — Independent benchmark of GPT-OSS-120B across 6 providers.
<a id="ref-4"></a>[4] OpenRouter Pricing & BYOK Documentation — 400+ models, 1M free BYOK requests/month. See also: OpenRouter BYOK Announcement.
<a id="ref-5"></a>[5] OpenClaw Deploy Cost Guide by WenHao Yu — Comprehensive hosting cost analysis ($0–8/month configurations).
<a id="ref-6"></a>[6] Z.ai (Zhipu AI) Wikipedia — Company background, HKEX listing, U.S. Entity List status.
<a id="ref-7"></a>[7] Contemplating Local LLMs vs OpenRouter and Z.ai — First-hand speed testing (20–30 tok/s on Z.ai direct).
<a id="ref-8"></a>[8] CrowdStrike: What Security Teams Need to Know About OpenClaw — Security analysis. See also: Infosecurity Magazine, Cisco Blog, Trend Micro.
<a id="ref-9"></a>[9] Kimi K2.5 on Hugging Face — Model card, benchmarks, architecture. See also: Artificial Analysis: Kimi K2.5, eesel.ai Pricing Guide, VentureBeat: K2.5 and Agent Swarms.
<a id="ref-10"></a>[10] GLM-5 on Hugging Face — 744B MoE open weights, MIT license. See also: Bloomberg: China's Zhipu Unveils New AI Model, SCMP: GLM-5 Launch, Artificial Analysis: GLM-5, Simon Willison: GLM-5 From Vibe Coding to Agentic Engineering.
<a id="ref-11"></a>[11] Hardware Corner: DeepSeek V3 on Mac Studio M3 Ultra — Comprehensive benchmarks. See also: Lattice: M3 Ultra Performance Benchmarks, Creative Strategies: Mac Studio M3 Ultra AI Review, MacStories: Testing DeepSeek R1 on M3 Ultra.
<a id="ref-12"></a>[12] Macworld: 2026 Mac Studio M5 Ultra Predictions — M4 Ultra confirmed skipped, M5 Ultra expected H2 2026 with ~1,100 GB/s bandwidth.
Additional Sources:
- Cerebras Pricing — Model speeds and per-token costs.
- Together AI Pricing — Full model catalog.
- Fireworks AI Pricing — Serverless and on-demand pricing.
- Groq Pricing — LPU-based inference, batch discounts.
- DeepSeek API Pricing — Cache hit/miss pricing tiers.
- Z.ai Official Pricing — Full GLM model catalog.
- OpenRouter: Kimi K2.5 — Provider pricing comparison.
- OpenRouter: GLM-5 — Western access pricing.
- Apple M3 Ultra Specs — Official hardware specifications.
- PricePerToken.com — LLM API pricing comparison (300+ models).
- OpenRouter State of AI — Market share and usage trends.