← Back to Blog
|18 min read|tech

Running OpenClaw: A Cost Engineering Analysis of LLM Inference Providers (2026)

A quantitative analysis of running OpenClaw across 10 inference providers — including Kimi K2.5, GLM-5, and M3 Ultra Mac Studio hardware benchmarks. Hard math on cost-per-token, throughput benchmarks, smart routing savings, and projected monthly expenses with inline charts.


Disclosure: This is an automated research report generated by Claude (Anthropic) on February 12, 2026 (updated February 12, 2026). It was commissioned by Optimal as part of internal infrastructure research for deploying autonomous AI agents. Nothing in this report constitutes financial advice. All pricing data sourced from provider documentation and third-party benchmarks as of the publication date.


Executive Summary

OpenClaw is not a model — it is an open-source AI agent orchestration platform (180K+ GitHub stars, MIT license) that connects any LLM to messaging channels (WhatsApp, Telegram, Discord, Slack, iMessage) with autonomous tool execution, persistent memory, and scheduling [1].

The critical cost decision is not OpenClaw itself (free), but which LLM backend to power it. This report benchmarks 10 inference providers and 3 new frontier models — including Kimi K2.5 (Moonshot AI) and GLM-5 (Zhipu AI) — across price, speed, and reliability. We also evaluate the M3 Ultra Mac Studio as a local inference alternative.

Key finding: A well-configured OpenClaw deployment costs $5–30/month for regular use. The new Chinese open-weight models (Kimi K2.5 at $0.60/M input, GLM-5 at $1.00/M input) deliver frontier-class performance at 5–8x less than Claude Opus or GPT-5. Smart routing via ClawRouter reduces costs by 70–78% [2].

Deployment ProfileMonthly CostModel Strategy
Hobby (10–50 msgs/day)$0–10Ollama local or free-tier APIs
Regular (50–200 msgs/day)$15–30DeepSeek V3 + Groq fallback
Power (200–500 msgs/day)$40–100ClawRouter multi-model + Kimi K2.5
Enterprise (500+ msgs/day)$100–800+Claude/GPT-5 + smart routing
Local hardware (M3 Ultra)$207/mo amortizedPrivacy-first, offline, 671B models

Part 1: The Provider Landscape (2026 Update)

Ten providers were evaluated across speed, cost, and reliability. The benchmark uses GPT-OSS-120B (open-weights, available cross-provider) for apples-to-apples comparison [3].

Head-to-Head Benchmark: Same Model, Different Providers

ProviderSpeed (tok/s)TTFTInput $/1MOutput $/1MReliability
Cerebras2,9880.26s$0.35$0.7595%+
Together AI9170.78s$0.15$0.6095%+
Fireworks AI7470.17s$0.15$0.6095%+
Groq4560.19s$0.15$0.6095%+
Baseten3410.73s95%+
Clarifai3130.27s$0.09$0.0995%+
DeepInfra79–2580.23–1.27s$0.08$0.3068–70%
<div style="max-width:620px;margin:2rem auto;"> <svg viewBox="0 0 620 300" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;font-family:ui-monospace,monospace;"> <rect width="620" height="300" rx="12" fill="#1a1a2e" stroke="#2a2a4a" stroke-width="1"/> <text x="310" y="30" text-anchor="middle" fill="#e2e8f0" font-size="13" font-weight="600">Provider Speed: Tokens/Second (GPT-OSS-120B)</text> <text x="105" y="64" text-anchor="end" fill="#e2e8f0" font-size="11">Cerebras</text> <rect x="115" y="50" width="420" height="22" rx="4" fill="#10b981" opacity="0.85"/> <text x="545" y="65" fill="#6ee7b7" font-size="11" font-weight="600">2,988 t/s</text> <text x="105" y="98" text-anchor="end" fill="#e2e8f0" font-size="11">Together AI</text> <rect x="115" y="84" width="129" height="22" rx="4" fill="#3b82f6" opacity="0.85"/> <text x="254" y="99" fill="#93c5fd" font-size="11" font-weight="600">917</text> <text x="105" y="132" text-anchor="end" fill="#e2e8f0" font-size="11">Fireworks</text> <rect x="115" y="118" width="105" height="22" rx="4" fill="#f59e0b" opacity="0.85"/> <text x="230" y="133" fill="#fcd34d" font-size="11" font-weight="600">747</text> <text x="105" y="166" text-anchor="end" fill="#e2e8f0" font-size="11">Groq</text> <rect x="115" y="152" width="64" height="22" rx="4" fill="#8b5cf6" opacity="0.85"/> <text x="189" y="167" fill="#c4b5fd" font-size="11" font-weight="600">456</text> <text x="105" y="200" text-anchor="end" fill="#e2e8f0" font-size="11">Baseten</text> <rect x="115" y="186" width="48" height="22" rx="4" fill="#ec4899" opacity="0.85"/> <text x="173" y="201" fill="#f9a8d4" font-size="11" font-weight="600">341</text> <text x="105" y="234" text-anchor="end" fill="#e2e8f0" font-size="11">Clarifai</text> <rect x="115" y="220" width="44" height="22" rx="4" fill="#06b6d4" opacity="0.85"/> <text x="169" y="235" fill="#67e8f9" font-size="11" font-weight="600">313</text> <text x="105" y="268" text-anchor="end" fill="#e2e8f0" font-size="11">DeepInfra</text> <rect x="115" y="254" width="36" height="22" rx="4" fill="#ef4444" opacity="0.85"/> <text x="161" y="269" fill="#fca5a5" font-size="11" font-weight="600">79–258</text> <text x="310" y="294" text-anchor="middle" fill="#64748b" font-size="9" font-style="italic">Source: lonelypx.com, Artificial Analysis — Feb 2026</text> </svg> </div>

Takeaway: Cerebras is 3x faster than the next competitor. Fireworks has the lowest latency (0.17s TTFT). DeepInfra is cheapest but unreliable — avoid for production [3].

Frontier Model Pricing (Per 1M Tokens) — Updated Feb 2026

ModelInput $/1MOutput $/1MContextOpen Source?
DeepSeek V3.2$0.25$0.38163KYes
Kimi K2.5$0.60$3.00256KYes (MIT)
GLM-5$1.00$3.20200KYes (MIT)
Gemini 3 Flash$0.50$3.001MNo
Claude Sonnet 4.5$3.00$15.00200KNo
GPT-5.3 Codex$3.00$12.00256KNo
Claude Opus 4.6$5.00$25.00200KNo
<div style="max-width:620px;margin:2rem auto;"> <svg viewBox="0 0 620 340" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;font-family:ui-monospace,monospace;"> <rect width="620" height="340" rx="12" fill="#1a1a2e" stroke="#2a2a4a" stroke-width="1"/> <text x="310" y="30" text-anchor="middle" fill="#e2e8f0" font-size="13" font-weight="600">Output Cost per 1M Tokens — Frontier Models</text> <text x="120" y="64" text-anchor="end" fill="#e2e8f0" font-size="11">DeepSeek V3.2</text> <rect x="130" y="50" width="6" height="22" rx="4" fill="#10b981" opacity="0.85"/> <text x="146" y="65" fill="#6ee7b7" font-size="11" font-weight="600">$0.38</text> <text x="120" y="98" text-anchor="end" fill="#e2e8f0" font-size="11">Kimi K2.5</text> <rect x="130" y="84" width="48" height="22" rx="4" fill="#06b6d4" opacity="0.85"/> <text x="188" y="99" fill="#67e8f9" font-size="11" font-weight="600">$3.00</text> <text x="120" y="132" text-anchor="end" fill="#e2e8f0" font-size="11">Gemini 3 Flash</text> <rect x="130" y="118" width="48" height="22" rx="4" fill="#f59e0b" opacity="0.85"/> <text x="188" y="133" fill="#fcd34d" font-size="11" font-weight="600">$3.00</text> <text x="120" y="166" text-anchor="end" fill="#e2e8f0" font-size="11">GLM-5</text> <rect x="130" y="152" width="51" height="22" rx="4" fill="#8b5cf6" opacity="0.85"/> <text x="191" y="167" fill="#c4b5fd" font-size="11" font-weight="600">$3.20</text> <text x="120" y="200" text-anchor="end" fill="#e2e8f0" font-size="11">GPT-5.3 Codex</text> <rect x="130" y="186" width="192" height="22" rx="4" fill="#f97316" opacity="0.85"/> <text x="332" y="201" fill="#fdba74" font-size="11" font-weight="600">$12.00</text> <text x="120" y="234" text-anchor="end" fill="#e2e8f0" font-size="11">Claude Sonnet 4.5</text> <rect x="130" y="220" width="240" height="22" rx="4" fill="#ec4899" opacity="0.85"/> <text x="380" y="235" fill="#f9a8d4" font-size="11" font-weight="600">$15.00</text> <text x="120" y="268" text-anchor="end" fill="#e2e8f0" font-size="11">Claude Opus 4.6</text> <rect x="130" y="254" width="400" height="22" rx="4" fill="#ef4444" opacity="0.85"/> <text x="540" y="269" fill="#fca5a5" font-size="11" font-weight="600">$25.00</text> <line x1="130" y1="290" x2="530" y2="290" stroke="#2a2a4a" stroke-width="0.5"/> <text x="130" y="306" fill="#64748b" font-size="9">$0</text> <text x="210" y="306" fill="#64748b" font-size="9">$5</text> <text x="290" y="306" fill="#64748b" font-size="9">$10</text> <text x="370" y="306" fill="#64748b" font-size="9">$15</text> <text x="450" y="306" fill="#64748b" font-size="9">$20</text> <text x="530" y="306" fill="#64748b" font-size="9">$25</text> <text x="310" y="332" text-anchor="middle" fill="#64748b" font-size="9" font-style="italic">Kimi K2.5 and GLM-5 deliver frontier benchmarks at mid-tier pricing</text> </svg> </div>

Part 2: New Contenders — Kimi K2.5 & GLM-5

Kimi K2.5 (Moonshot AI) — Released January 27, 2026

A 1 trillion parameter MoE model (32B active per token, 384 experts, MIT license) with native vision and a 256K context window. Available on Hugging Face and all major providers [9].

BenchmarkKimi K2.5Claude Opus 4.5GPT-5.2Llama 3.3 70B
MMLU-Pro87.1~87.587.1~80
SWE-Bench Verified76.877.2–82.0~45–50
GPQA-Diamond87.6~86~60
BrowseComp60.6–78.424.154.9N/A
HLE-Full (w/ tools)50.2~40~45N/A

Pricing across providers:

ProviderInput $/1MOutput $/1MSpeed (tok/s)
Moonshot (official)$0.60$3.0037
OpenRouter (DeepInfra)$0.45$2.25
Fireworks~$1.07 blended219
Together AI~$1.07 blended56
Baseten341

Community verdict: "Right up there with Sonnet 4.5" for CRUD web apps. Wins massively on agentic search (BrowseComp). Caveat: K2.5 is ~3x more verbose than Opus — the effective cost savings are closer to 3x, not 9x [9].

GLM-5 (Zhipu AI / Z.ai) — Released February 11, 2026

A 744B MoE model (40–44B active, 256 experts, MIT license) trained entirely on Huawei Ascend chips — zero NVIDIA dependency. Released alongside SLIME, an open-source async RL training framework [10].

BenchmarkGLM-5Claude Opus 4.5GPT-5.2Kimi K2.5
SWE-Bench Verified77.880.980.076.8
BrowseComp75.967.865.860.6
HLE-Full (w/ tools)50.443.445.550.2
Terminal-Bench 2.056.259.354.050.8
AIME 2026 I92.796.1
Hallucination (AA)Record low

Z.ai pricing tier (complete):

ModelInput $/1MOutput $/1MNotes
GLM-5 (flagship)$1.00$3.20New SOTA open model
GLM-4.7$0.60$2.20Previous flagship
GLM-4.7-FlashX$0.07$0.40Budget powerhouse
GLM-4.7-FlashFreeFreeRate-limited, 200K ctx
GLM-4.5-FlashFreeFreeRate-limited

Western access: Available day-1 on OpenRouter ($0.80/$2.56 via AtlasCloud), DeepInfra, and Vercel AI Gateway. No VPN needed.

Compliance note: Z.ai remains on the U.S. Commerce Department Entity List (since Jan 2025). Use GLM models via western providers (OpenRouter, Fireworks) to mitigate regulatory risk [6].

The Chinese Open-Weight Value Play

<div style="max-width:620px;margin:2rem auto;"> <svg viewBox="0 0 620 260" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;font-family:ui-monospace,monospace;"> <rect width="620" height="260" rx="12" fill="#1a1a2e" stroke="#2a2a4a" stroke-width="1"/> <text x="310" y="28" text-anchor="middle" fill="#e2e8f0" font-size="13" font-weight="600">Benchmark Score vs Output Price (SWE-Bench Verified)</text> <text x="310" y="46" text-anchor="middle" fill="#64748b" font-size="10">Higher = better code. Leftward = cheaper. Best position = top-left.</text> <line x1="70" y1="60" x2="70" y2="210" stroke="#2a2a4a" stroke-width="0.5"/> <line x1="70" y1="210" x2="590" y2="210" stroke="#2a2a4a" stroke-width="0.5"/> <text x="66" y="74" text-anchor="end" fill="#64748b" font-size="9">82%</text> <text x="66" y="114" text-anchor="end" fill="#64748b" font-size="9">78%</text> <text x="66" y="154" text-anchor="end" fill="#64748b" font-size="9">74%</text> <text x="66" y="194" text-anchor="end" fill="#64748b" font-size="9">70%</text> <line x1="70" y1="70" x2="590" y2="70" stroke="#2a2a4a" stroke-width="0.3" stroke-dasharray="4"/> <line x1="70" y1="110" x2="590" y2="110" stroke="#2a2a4a" stroke-width="0.3" stroke-dasharray="4"/> <line x1="70" y1="150" x2="590" y2="150" stroke="#2a2a4a" stroke-width="0.3" stroke-dasharray="4"/> <line x1="70" y1="190" x2="590" y2="190" stroke="#2a2a4a" stroke-width="0.3" stroke-dasharray="4"/> <text x="90" y="226" fill="#64748b" font-size="9">$0.38</text> <text x="170" y="226" fill="#64748b" font-size="9">$3.00</text> <text x="280" y="226" fill="#64748b" font-size="9">$12</text> <text x="380" y="226" fill="#64748b" font-size="9">$15</text> <text x="530" y="226" fill="#64748b" font-size="9">$25</text> <text x="330" y="246" text-anchor="middle" fill="#64748b" font-size="9">Output cost per 1M tokens →</text> <circle cx="90" cy="194" r="8" fill="#10b981" opacity="0.8"/> <text x="90" y="188" text-anchor="middle" fill="#6ee7b7" font-size="8">DSv3.2</text> <text x="90" y="204" text-anchor="middle" fill="#6ee7b7" font-size="7">70.2%</text> <circle cx="168" cy="102" r="10" fill="#06b6d4" opacity="0.8"/> <text x="168" y="96" text-anchor="middle" fill="#67e8f9" font-size="8">K2.5</text> <text x="168" y="112" text-anchor="middle" fill="#67e8f9" font-size="7">76.8%</text> <circle cx="182" cy="90" r="10" fill="#8b5cf6" opacity="0.8"/> <text x="182" y="84" text-anchor="middle" fill="#c4b5fd" font-size="8">GLM-5</text> <text x="182" y="100" text-anchor="middle" fill="#c4b5fd" font-size="7">77.8%</text> <circle cx="290" cy="78" r="9" fill="#f97316" opacity="0.8"/> <text x="290" y="72" text-anchor="middle" fill="#fdba74" font-size="8">GPT-5.2</text> <text x="290" y="88" text-anchor="middle" fill="#fdba74" font-size="7">80.0%</text> <circle cx="390" cy="82" r="9" fill="#ec4899" opacity="0.8"/> <text x="390" y="76" text-anchor="middle" fill="#f9a8d4" font-size="8">Sonnet</text> <text x="390" y="92" text-anchor="middle" fill="#f9a8d4" font-size="7">77.2%</text> <circle cx="540" cy="70" r="9" fill="#ef4444" opacity="0.8"/> <text x="540" y="64" text-anchor="middle" fill="#fca5a5" font-size="8">Opus</text> <text x="540" y="80" text-anchor="middle" fill="#fca5a5" font-size="7">80.9%</text> </svg> </div>

The bottom line: Kimi K2.5 and GLM-5 are within 3–4 points of Claude Opus on SWE-Bench Verified — at $3.00–3.20/M output vs $25.00/M output. That's a 7–8x cost reduction for ~96% of the coding capability. Both are fully MIT-licensed with open weights.


Part 3: The Math — Cost Modeling

Token Economics Primer

A typical OpenClaw conversation turn consumes:

  • System prompt + memory context: ~2,000 tokens (input)
  • User message: ~100 tokens (input)
  • Tool calls + results: ~500 tokens (input/output)
  • Agent response: ~300 tokens (output)

Per-turn total: ~2,600 input + ~800 output tokens

Cost Per Turn by Provider (Including New Models)

ProviderModelCost/Turn100 turns/day30-day cost
Together AIGemma 3n E4B$0.000084$0.008$0.25
GroqLlama 3.1 8B$0.000194$0.019$0.58
CerebrasLlama 3.1 8B$0.000340$0.034$1.02
GroqGPT-OSS-120B$0.000870$0.087$2.61
OpenRouterDeepSeek V3.2$0.000954$0.095$2.87
OpenRouterKimi K2.5$0.002970$0.297$8.91
Z.aiGLM-5$0.005160$0.516$15.48
Z.aiGLM-4.7-Flash$0.000000$0.000$0.00
GroqLlama 3.3 70B$0.002166$0.217$6.50
DirectClaude Sonnet 4.5$0.019800$1.980$59.40
DirectClaude Opus 4.6$0.033000$3.300$99.00

Formula: Cost/turn = (input_tokens × input_price/1M) + (output_tokens × output_price/1M)

Kimi K2.5 (OpenRouter): (2,600 × $0.45/1M) + (800 × $2.25/1M) = $0.001170 + $0.001800 = $0.002970

GLM-5 (Z.ai): (2,600 × $1.00/1M) + (800 × $3.20/1M) = $0.002600 + $0.002560 = $0.005160

Monthly Cost Projection — Visual Scaling

<div style="max-width:620px;margin:2rem auto;"> <svg viewBox="0 0 620 340" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;font-family:ui-monospace,monospace;"> <rect width="620" height="340" rx="12" fill="#1a1a2e" stroke="#2a2a4a" stroke-width="1"/> <text x="310" y="28" text-anchor="middle" fill="#e2e8f0" font-size="13" font-weight="600">Monthly Cost by Daily Message Volume</text> <text x="50" y="62" text-anchor="end" fill="#94a3b8" font-size="10">$600</text> <text x="50" y="102" text-anchor="end" fill="#94a3b8" font-size="10">$450</text> <text x="50" y="142" text-anchor="end" fill="#94a3b8" font-size="10">$300</text> <text x="50" y="182" text-anchor="end" fill="#94a3b8" font-size="10">$150</text> <text x="50" y="222" text-anchor="end" fill="#94a3b8" font-size="10">$50</text> <text x="50" y="262" text-anchor="end" fill="#94a3b8" font-size="10">$0</text> <line x1="60" y1="58" x2="580" y2="58" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="98" x2="580" y2="98" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="138" x2="580" y2="138" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="178" x2="580" y2="178" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="218" x2="580" y2="218" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="258" x2="580" y2="258" stroke="#94a3b8" stroke-width="0.5"/> <text x="100" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">25/day</text> <text x="204" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">100/day</text> <text x="308" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">250/day</text> <text x="412" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">500/day</text> <text x="516" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">1000/day</text> <polyline points="100,257 204,257 308,256 412,255 516,254" fill="none" stroke="#10b981" stroke-width="2.5" stroke-linecap="round"/> <polyline points="100,255 204,249 308,236 412,214 516,170" fill="none" stroke="#06b6d4" stroke-width="2.5" stroke-linecap="round"/> <polyline points="100,253 204,241 308,215 412,172 516,86" fill="none" stroke="#8b5cf6" stroke-width="2.5" stroke-linecap="round"/> <polyline points="100,243 204,218 308,178 412,98 516,58" fill="none" stroke="#ec4899" stroke-width="2.5" stroke-linecap="round"/> <polyline points="100,225 204,178 308,98 412,58 516,58" fill="none" stroke="#ef4444" stroke-width="2.5" stroke-linecap="round" stroke-dasharray="6,3"/> <circle cx="100" cy="257" r="3" fill="#10b981"/><circle cx="204" cy="257" r="3" fill="#10b981"/><circle cx="308" cy="256" r="3" fill="#10b981"/><circle cx="412" cy="255" r="3" fill="#10b981"/><circle cx="516" cy="254" r="3" fill="#10b981"/> <circle cx="100" cy="255" r="3" fill="#06b6d4"/><circle cx="204" cy="249" r="3" fill="#06b6d4"/><circle cx="308" cy="236" r="3" fill="#06b6d4"/><circle cx="412" cy="214" r="3" fill="#06b6d4"/><circle cx="516" cy="170" r="3" fill="#06b6d4"/> <circle cx="100" cy="253" r="3" fill="#8b5cf6"/><circle cx="204" cy="241" r="3" fill="#8b5cf6"/><circle cx="308" cy="215" r="3" fill="#8b5cf6"/><circle cx="412" cy="172" r="3" fill="#8b5cf6"/><circle cx="516" cy="86" r="3" fill="#8b5cf6"/> <circle cx="100" cy="243" r="3" fill="#ec4899"/><circle cx="204" cy="218" r="3" fill="#ec4899"/><circle cx="308" cy="178" r="3" fill="#ec4899"/><circle cx="412" cy="98" r="3" fill="#ec4899"/> <rect x="90" y="290" width="12" height="3" rx="1" fill="#10b981"/><text x="108" y="294" fill="#94a3b8" font-size="9">Groq 8B ($0.58)</text> <rect x="195" y="290" width="12" height="3" rx="1" fill="#06b6d4"/><text x="213" y="294" fill="#94a3b8" font-size="9">Kimi K2.5 ($8.91)</text> <rect x="310" y="290" width="12" height="3" rx="1" fill="#8b5cf6"/><text x="328" y="294" fill="#94a3b8" font-size="9">GLM-5 ($15.48)</text> <rect x="405" y="290" width="12" height="3" rx="1" fill="#ec4899"/><text x="423" y="294" fill="#94a3b8" font-size="9">Sonnet ($59)</text> <rect x="495" y="290" width="12" height="3" rx="1" fill="#ef4444"/><text x="513" y="294" fill="#94a3b8" font-size="9">Opus ($99)</text> <text x="310" y="332" text-anchor="middle" fill="#64748b" font-size="9" font-style="italic">Monthly cost at 100 msgs/day reference point. Opus/Sonnet scale exceeds chart at high volumes.</text> </svg> </div>
Daily MessagesGroq 8BKimi K2.5GLM-5GLM-4.7-FlashSonnet 4.5Opus 4.6
25$0.15$2.23$3.87$0.00$14.85$24.75
100$0.58$8.91$15.48$0.00$59.40$99.00
250$1.46$22.28$38.70$0.00$148.50$247.50
500$2.91$44.55$77.40$0.00$297.00$495.00
1,000$5.82$89.10$154.80$0.00$594.00$990.00

The standout: GLM-4.7-Flash is genuinely free (rate-limited) with 200K context. For agent prototyping and high-volume tool-calling, this is unbeatable at $0.


Part 4: Smart Routing — The 78% Cost Reduction

The Problem

A naive OpenClaw config sends every request — including "what time is it?" and "check my calendar" — to the same model. If that model is Claude Opus at $25/1M output tokens, you're paying frontier prices for trivial tasks.

ClawRouter Architecture

ClawRouter analyzes each prompt locally (<1ms, zero API calls) using a 14-dimension scoring system and routes to the cheapest capable model [2].

Tier% of RequestsRoute ToAvg Cost/TurnWeighted Cost
SIMPLE (45%)Status, math, timeGLM-4.7-Flash$0.0000$0.000000
MODERATE (30%)Summarize, translateKimi K2.5$0.0030$0.000900
COMPLEX (20%)Coding, reasoningGLM-5 or Sonnet$0.0052$0.001040
EXPERT (5%)Architecture, researchClaude Opus 4.6$0.0330$0.001650
Blended100%$0.003590

Without routing (all Claude Opus): $0.033/turn With routing (blended, using new models): $0.0036/turn Savings: 89.1% (up from 81.7% before Kimi K2.5 and GLM free tier)

One developer reported their Anthropic bill dropped from $4,660/month to ~$1,400/month using ClawRouter — a 70% reduction — because ~60% of their agent's requests were simple enough for budget models [2].

OpenRouter Alternative

OpenRouter offers similar capability through its Auto Model feature [4]:

FeatureOpenRouterClawRouter
Routing locationServer-sideLocal (<1ms)
BYOK supportYes (1M free req/month)N/A
Fee structure0% markup + 5% BYOK after 1MFree (open-source)
Model access400+ models (incl. Kimi K2.5, GLM-5)Configure your own
FailoverAutomatic (50+ providers)Manual config

Part 5: Local Hardware — M3 Ultra Mac Studio

Why Consider Local Inference?

Cloud APIs win on per-token cost for most workloads. But local hardware makes sense when: (1) privacy/data sovereignty is non-negotiable, (2) you need offline access, (3) you want to run massive 400B+ models, or (4) you're deploying fine-tuned models not available via API.

M3 Ultra Mac Studio Specifications

SpecM3 Ultra (32C/80G)
CPU32-core (24P + 8E)
GPU80-core
Max Unified Memory512GB LPDDR5x
Memory Bandwidth819 GB/s
TDP~100W under load
Price (256GB)$6,999
Price (512GB)$9,499

Key advantage: 512GB of unified memory in a single box — enough to run DeepSeek V3 671B quantized. An equivalent GPU setup (8x H100) would cost $200K+.

Tokens/Second Benchmarks (MLX Framework)

<div style="max-width:620px;margin:2rem auto;"> <svg viewBox="0 0 620 380" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;font-family:ui-monospace,monospace;"> <rect width="620" height="380" rx="12" fill="#1a1a2e" stroke="#2a2a4a" stroke-width="1"/> <text x="310" y="28" text-anchor="middle" fill="#e2e8f0" font-size="13" font-weight="600">M3 Ultra 512GB: Generation Speed by Model (MLX)</text> <text x="130" y="62" text-anchor="end" fill="#e2e8f0" font-size="10">Gemma 3 1B (Q4)</text> <rect x="140" y="49" width="395" height="20" rx="3" fill="#10b981" opacity="0.85"/> <text x="545" y="63" fill="#6ee7b7" font-size="10" font-weight="600">237 t/s</text> <text x="130" y="90" text-anchor="end" fill="#e2e8f0" font-size="10">Gemma 3 4B (Q4)</text> <rect x="140" y="77" width="223" height="20" rx="3" fill="#10b981" opacity="0.75"/> <text x="373" y="91" fill="#6ee7b7" font-size="10" font-weight="600">134 t/s</text> <text x="130" y="118" text-anchor="end" fill="#e2e8f0" font-size="10">Llama 3.1 8B (4-bit)</text> <rect x="140" y="105" width="217" height="20" rx="3" fill="#3b82f6" opacity="0.85"/> <text x="367" y="119" fill="#93c5fd" font-size="10" font-weight="600">130 t/s</text> <text x="130" y="146" text-anchor="end" fill="#e2e8f0" font-size="10">QwQ 32B (4-bit)</text> <rect x="140" y="133" width="58" height="20" rx="3" fill="#f59e0b" opacity="0.85"/> <text x="208" y="147" fill="#fcd34d" font-size="10" font-weight="600">35 t/s</text> <text x="130" y="174" text-anchor="end" fill="#e2e8f0" font-size="10">Qwen3 235B (FP8)</text> <rect x="140" y="161" width="50" height="20" rx="3" fill="#8b5cf6" opacity="0.85"/> <text x="200" y="175" fill="#c4b5fd" font-size="10" font-weight="600">30 t/s</text> <text x="130" y="202" text-anchor="end" fill="#e2e8f0" font-size="10">DeepSeek V3 671B</text> <rect x="140" y="189" width="35" height="20" rx="3" fill="#ec4899" opacity="0.85"/> <text x="185" y="203" fill="#f9a8d4" font-size="10" font-weight="600">21 t/s</text> <text x="130" y="230" text-anchor="end" fill="#e2e8f0" font-size="10">DeepSeek R1 671B</text> <rect x="140" y="217" width="33" height="20" rx="3" fill="#ec4899" opacity="0.75"/> <text x="183" y="231" fill="#f9a8d4" font-size="10" font-weight="600">20 t/s</text> <text x="130" y="258" text-anchor="end" fill="#e2e8f0" font-size="10">Llama 3.3 70B (Q4)</text> <rect x="140" y="245" width="28" height="20" rx="3" fill="#3b82f6" opacity="0.75"/> <text x="178" y="259" fill="#93c5fd" font-size="10" font-weight="600">17 t/s</text> <text x="130" y="286" text-anchor="end" fill="#e2e8f0" font-size="10">GLM-4.7 358B (Q3)</text> <rect x="140" y="273" width="25" height="20" rx="3" fill="#8b5cf6" opacity="0.75"/> <text x="175" y="287" fill="#c4b5fd" font-size="10" font-weight="600">15 t/s</text> <text x="310" y="318" text-anchor="middle" fill="#94a3b8" font-size="10">MoE models (DeepSeek, Qwen3 235B) outperform their size class</text> <text x="310" y="334" text-anchor="middle" fill="#94a3b8" font-size="10">because only active experts (~37B) need memory bandwidth per token</text> <text x="310" y="360" text-anchor="middle" fill="#64748b" font-size="9" font-style="italic">Sources: Hardware Corner, Lattice, Creative Strategies, MacStories — Feb 2026</text> </svg> </div>
ModelSizeQuanttok/s (gen)RAM Needed
Gemma 3 1B1BQ4237<4GB
Llama 3.1 8B8B4-bit130~5GB
QwQ 32B32B4-bit35~20GB
Qwen3 235B (MoE)235BFP830~256GB
DeepSeek V3 (MoE)671B4-bit21~405GB
DeepSeek R1 (MoE)671B4-bit20~405GB
Llama 3.3 70B70BQ4_K_M17~40GB
GLM-4.7 358B (MoE)358BQ3~15~256GB

Critical context: DeepSeek V3's speed drops from 21 tok/s to 5.8 tok/s as context grows from 69 to 16K tokens. The KV cache competes for memory bandwidth. Plan for 10–15 tok/s at realistic conversation lengths.

Framework Choice Matters

FrameworkSpeed vs MLXBest For
MLX (Apple native)Baseline (fastest)Maximum Apple Silicon performance
LM Studio (MLX backend)~SameGUI + ease of use
llama.cpp20–50% slowerCross-platform, broader model support
Ollama20–40% slowerEasiest setup, REST API

Use MLX for best performance. For DeepSeek V3 671B: MLX achieved 21 tok/s vs llama.cpp's 6.2 tok/s — a 3.4x difference [11].

Cloud API vs. Local Hardware Break-Even

Usage PatternM3 Ultra 512GB ($/M tokens)Cheapest Cloud (DeepInfra)Winner
Light (2M tok/day)$3.45/M$0.30/MCloud wins 11x
Medium (10M tok/day)$0.69/M$0.30/MCloud wins 2.3x
Heavy 24/7$5.30/M$0.30/MCloud wins
Privacy/offlinePricelessN/ALocal wins
Run 671B models$0.69–5.30/M$3.00–25.00/M (API)Local wins

The uncomfortable truth: At current cloud API pricing, local inference rarely breaks even on pure cost. Cloud providers benefit from massive batch sizes, FP8 hardware optimization, and economies of scale. The M3 Ultra's value proposition is the 512GB unified memory pool — running models that would require $200K in GPU hardware otherwise.

M5 Ultra (expected late 2026): ~1,100 GB/s bandwidth (+34%), ~768GB max RAM, 3–4x faster prompt processing. Worth waiting for if not urgent [12].


Part 6: Security Cost — The Hidden Variable

OpenClaw's permissionless architecture introduces non-trivial security costs [8]:

IncidentDateImpact
40,000+ exposed instancesJan 202693.4% had auth bypass flaws
Moltbook data leakJan 20261.5M API tokens, 35K emails exposed
Scam token via hijacked agentJan 2026$16M in losses
Prompt injection backdoorsOngoingAgents execute attacker instructions

Mitigation cost considerations:

Security MeasureImplementationCost Impact
Authentication (required)Gateway token auth+$0 (config change)
Docker sandboxingPer-session containers+10–20% memory overhead
Prompt injection defenseLatest-gen models only+$10–50/month (use Claude over Llama)
NanoClawHardened fork$0 (open-source)

Conclusion & Updated Recommendation

Cost-Optimized OpenClaw Stack for Optimal (February 2026)

LayerChoiceMonthly Cost
HostingHetzner CX22 VPS$4.15
Agent PlatformOpenClaw (MIT license)$0
LLM RouterOpenRouter (BYOK, 1M free req/mo)$0
Simple tasks (45%)GLM-4.7-Flash (free tier)$0
Moderate tasks (30%)Kimi K2.5 via OpenRouter~$5–10
Complex tasks (20%)GLM-5 or Claude Sonnet 4.5~$5–15
Expert tasks (5%)Claude Opus 4.6 via BYOK~$3–8
SecurityNanoClaw fork + Docker sandboxing$0
Total projected$17–37/month

Decision Matrix (Updated)

If you need...Choose...Why
Fastest inferenceCerebras2,988 tok/s, 3x faster than #2
Lowest latencyFireworks AI0.17s TTFT
Best free tierGLM-4.7-FlashFree, 200K context, genuinely capable
Best open-weight frontierGLM-5 or Kimi K2.5MIT license, 96% of Opus quality, 7–8x cheaper
Best agentic modelKimi K2.5 (Swarm mode)BrowseComp 78.4%, HLE-tools 50.2%
Best cost/performanceDeepSeek V3.2$0.38/M output, solid for moderate tasks
Maximum cost controlClawRouter + multi-model89% savings via smart routing
Run 671B models locallyM3 Ultra 512GBOnly $9.5K box that fits DeepSeek V3
Zero spendOracle Free + Ollama$0/month, limited capability
Regulatory safetyAvoid Z.ai directUse GLM models via OpenRouter/Fireworks

This report was generated on February 12, 2026 using parallel AI research agents (Claude, Anthropic). Updated with Kimi K2.5, GLM-5, and M3 Ultra Mac Studio benchmarks on the same date. All claims are hyperlinked to their sources. This is not financial advice.

Automated report produced for Optimal | Technology Category


Sources & References

<a id="ref-1"></a>[1] OpenClaw Official Documentation — Architecture, deployment guides, and hardware requirements.

<a id="ref-2"></a>[2] ClawRouter — Smart LLM Router — Open-source routing layer claiming 78% cost savings. See also: ClawRouter: How I Cut My $4,660 Bill by 70% (Medium).

<a id="ref-3"></a>[3] Open Source AI API Providers: Speed, Cost & Performance Compared (2026) — Independent benchmark of GPT-OSS-120B across 6 providers.

<a id="ref-4"></a>[4] OpenRouter Pricing & BYOK Documentation — 400+ models, 1M free BYOK requests/month. See also: OpenRouter BYOK Announcement.

<a id="ref-5"></a>[5] OpenClaw Deploy Cost Guide by WenHao Yu — Comprehensive hosting cost analysis ($0–8/month configurations).

<a id="ref-6"></a>[6] Z.ai (Zhipu AI) Wikipedia — Company background, HKEX listing, U.S. Entity List status.

<a id="ref-7"></a>[7] Contemplating Local LLMs vs OpenRouter and Z.ai — First-hand speed testing (20–30 tok/s on Z.ai direct).

<a id="ref-8"></a>[8] CrowdStrike: What Security Teams Need to Know About OpenClaw — Security analysis. See also: Infosecurity Magazine, Cisco Blog, Trend Micro.

<a id="ref-9"></a>[9] Kimi K2.5 on Hugging Face — Model card, benchmarks, architecture. See also: Artificial Analysis: Kimi K2.5, eesel.ai Pricing Guide, VentureBeat: K2.5 and Agent Swarms.

<a id="ref-10"></a>[10] GLM-5 on Hugging Face — 744B MoE open weights, MIT license. See also: Bloomberg: China's Zhipu Unveils New AI Model, SCMP: GLM-5 Launch, Artificial Analysis: GLM-5, Simon Willison: GLM-5 From Vibe Coding to Agentic Engineering.

<a id="ref-11"></a>[11] Hardware Corner: DeepSeek V3 on Mac Studio M3 Ultra — Comprehensive benchmarks. See also: Lattice: M3 Ultra Performance Benchmarks, Creative Strategies: Mac Studio M3 Ultra AI Review, MacStories: Testing DeepSeek R1 on M3 Ultra.

<a id="ref-12"></a>[12] Macworld: 2026 Mac Studio M5 Ultra Predictions — M4 Ultra confirmed skipped, M5 Ultra expected H2 2026 with ~1,100 GB/s bandwidth.

Additional Sources: