February 12, 2026|18 min read|tech

Running OpenClaw: A Cost Engineering Analysis of LLM Inference Providers (2026)

A quantitative analysis of running OpenClaw across 10 inference providers — including Kimi K2.5, GLM-5, and M3 Ultra Mac Studio hardware benchmarks. Hard math on cost-per-token, throughput benchmarks, smart routing savings, and projected monthly expenses with inline charts.

Disclosure: This is an automated research report generated by Claude (Anthropic) on February 12, 2026 (updated February 12, 2026). It was commissioned by Optimal as part of internal infrastructure research for deploying autonomous AI agents. Nothing in this report constitutes financial advice. All pricing data sourced from provider documentation and third-party benchmarks as of the publication date.

Executive Summary

OpenClaw is not a model — it is an open-source AI agent orchestration platform (180K+ GitHub stars, MIT license) that connects any LLM to messaging channels (WhatsApp, Telegram, Discord, Slack, iMessage) with autonomous tool execution, persistent memory, and scheduling [1].

The critical cost decision is not OpenClaw itself (free), but which LLM backend to power it. This report benchmarks 10 inference providers and 3 new frontier models — including Kimi K2.5 (Moonshot AI) and GLM-5 (Zhipu AI) — across price, speed, and reliability. We also evaluate the M3 Ultra Mac Studio as a local inference alternative.

Key finding: A well-configured OpenClaw deployment costs $5–30/month for regular use. The new Chinese open-weight models (Kimi K2.5 at $0.60/M input, GLM-5 at $1.00/M input) deliver frontier-class performance at 5–8x less than Claude Opus or GPT-5. Smart routing via ClawRouter reduces costs by 70–78% [2].

Deployment Profile	Monthly Cost	Model Strategy
Hobby (10–50 msgs/day)	$0–10	Ollama local or free-tier APIs
Regular (50–200 msgs/day)	$15–30	DeepSeek V3 + Groq fallback
Power (200–500 msgs/day)	$40–100	ClawRouter multi-model + Kimi K2.5
Enterprise (500+ msgs/day)	$100–800+	Claude/GPT-5 + smart routing
Local hardware (M3 Ultra)	$207/mo amortized	Privacy-first, offline, 671B models

Part 1: The Provider Landscape (2026 Update)

Ten providers were evaluated across speed, cost, and reliability. The benchmark uses GPT-OSS-120B (open-weights, available cross-provider) for apples-to-apples comparison [3].

Head-to-Head Benchmark: Same Model, Different Providers

Provider	Speed (tok/s)	TTFT	Input $/1M	Output $/1M	Reliability
Cerebras	2,988	0.26s	$0.35	$0.75	95%+
Together AI	917	0.78s	$0.15	$0.60	95%+
Fireworks AI	747	0.17s	$0.15	$0.60	95%+
Groq	456	0.19s	$0.15	$0.60	95%+
Baseten	341	0.73s	—	—	95%+
Clarifai	313	0.27s	$0.09	$0.09	95%+
DeepInfra	79–258	0.23–1.27s	$0.08	$0.30	68–70%

<div style="max-width:620px;margin:2rem auto;"> <svg viewBox="0 0 620 300" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;font-family:ui-monospace,monospace;"> <rect width="620" height="300" rx="12" fill="#1a1a2e" stroke="#2a2a4a" stroke-width="1"/> <text x="310" y="30" text-anchor="middle" fill="#e2e8f0" font-size="13" font-weight="600">Provider Speed: Tokens/Second (GPT-OSS-120B)</text> <text x="105" y="64" text-anchor="end" fill="#e2e8f0" font-size="11">Cerebras</text> <rect x="115" y="50" width="420" height="22" rx="4" fill="#10b981" opacity="0.85"/> <text x="545" y="65" fill="#6ee7b7" font-size="11" font-weight="600">2,988 t/s</text> <text x="105" y="98" text-anchor="end" fill="#e2e8f0" font-size="11">Together AI</text> <rect x="115" y="84" width="129" height="22" rx="4" fill="#3b82f6" opacity="0.85"/> <text x="254" y="99" fill="#93c5fd" font-size="11" font-weight="600">917</text> <text x="105" y="132" text-anchor="end" fill="#e2e8f0" font-size="11">Fireworks</text> <rect x="115" y="118" width="105" height="22" rx="4" fill="#f59e0b" opacity="0.85"/> <text x="230" y="133" fill="#fcd34d" font-size="11" font-weight="600">747</text> <text x="105" y="166" text-anchor="end" fill="#e2e8f0" font-size="11">Groq</text> <rect x="115" y="152" width="64" height="22" rx="4" fill="#8b5cf6" opacity="0.85"/> <text x="189" y="167" fill="#c4b5fd" font-size="11" font-weight="600">456</text> <text x="105" y="200" text-anchor="end" fill="#e2e8f0" font-size="11">Baseten</text> <rect x="115" y="186" width="48" height="22" rx="4" fill="#ec4899" opacity="0.85"/> <text x="173" y="201" fill="#f9a8d4" font-size="11" font-weight="600">341</text> <text x="105" y="234" text-anchor="end" fill="#e2e8f0" font-size="11">Clarifai</text> <rect x="115" y="220" width="44" height="22" rx="4" fill="#06b6d4" opacity="0.85"/> <text x="169" y="235" fill="#67e8f9" font-size="11" font-weight="600">313</text> <text x="105" y="268" text-anchor="end" fill="#e2e8f0" font-size="11">DeepInfra</text> <rect x="115" y="254" width="36" height="22" rx="4" fill="#ef4444" opacity="0.85"/> <text x="161" y="269" fill="#fca5a5" font-size="11" font-weight="600">79–258</text> <text x="310" y="294" text-anchor="middle" fill="#64748b" font-size="9" font-style="italic">Source: lonelypx.com, Artificial Analysis — Feb 2026</text> </svg> </div>

Takeaway: Cerebras is 3x faster than the next competitor. Fireworks has the lowest latency (0.17s TTFT). DeepInfra is cheapest but unreliable — avoid for production [3].

Frontier Model Pricing (Per 1M Tokens) — Updated Feb 2026

Model	Input $/1M	Output $/1M	Context	Open Source?
DeepSeek V3.2	$0.25	$0.38	163K	Yes
Kimi K2.5	$0.60	$3.00	256K	Yes (MIT)
GLM-5	$1.00	$3.20	200K	Yes (MIT)
Gemini 3 Flash	$0.50	$3.00	1M	No
Claude Sonnet 4.5	$3.00	$15.00	200K	No
GPT-5.3 Codex	$3.00	$12.00	256K	No
Claude Opus 4.6	$5.00	$25.00	200K	No

Part 2: New Contenders — Kimi K2.5 & GLM-5

Kimi K2.5 (Moonshot AI) — Released January 27, 2026

A 1 trillion parameter MoE model (32B active per token, 384 experts, MIT license) with native vision and a 256K context window. Available on Hugging Face and all major providers [9].

Benchmark	Kimi K2.5	Claude Opus 4.5	GPT-5.2	Llama 3.3 70B
MMLU-Pro	87.1	~87.5	87.1	~80
SWE-Bench Verified	76.8	77.2–82.0	—	~45–50
GPQA-Diamond	87.6	~86	—	~60
BrowseComp	60.6–78.4	24.1	54.9	N/A
HLE-Full (w/ tools)	50.2	~40	~45	N/A

Pricing across providers:

Provider	Input $/1M	Output $/1M	Speed (tok/s)
Moonshot (official)	$0.60	$3.00	37
OpenRouter (DeepInfra)	$0.45	$2.25	—
Fireworks	~$1.07 blended	—	219
Together AI	~$1.07 blended	—	56
Baseten	—	—	341

Community verdict: "Right up there with Sonnet 4.5" for CRUD web apps. Wins massively on agentic search (BrowseComp). Caveat: K2.5 is ~3x more verbose than Opus — the effective cost savings are closer to 3x, not 9x [9].

GLM-5 (Zhipu AI / Z.ai) — Released February 11, 2026

A 744B MoE model (40–44B active, 256 experts, MIT license) trained entirely on Huawei Ascend chips — zero NVIDIA dependency. Released alongside SLIME, an open-source async RL training framework [10].

Benchmark	GLM-5	Claude Opus 4.5	GPT-5.2	Kimi K2.5
SWE-Bench Verified	77.8	80.9	80.0	76.8
BrowseComp	75.9	67.8	65.8	60.6
HLE-Full (w/ tools)	50.4	43.4	45.5	50.2
Terminal-Bench 2.0	56.2	59.3	54.0	50.8
AIME 2026 I	92.7	—	—	96.1
Hallucination (AA)	Record low	—	—	—

Z.ai pricing tier (complete):

Model	Input $/1M	Output $/1M	Notes
GLM-5 (flagship)	$1.00	$3.20	New SOTA open model
GLM-4.7	$0.60	$2.20	Previous flagship
GLM-4.7-FlashX	$0.07	$0.40	Budget powerhouse
GLM-4.7-Flash	Free	Free	Rate-limited, 200K ctx
GLM-4.5-Flash	Free	Free	Rate-limited

Western access: Available day-1 on OpenRouter ($0.80/$2.56 via AtlasCloud), DeepInfra, and Vercel AI Gateway. No VPN needed.

Compliance note: Z.ai remains on the U.S. Commerce Department Entity List (since Jan 2025). Use GLM models via western providers (OpenRouter, Fireworks) to mitigate regulatory risk [6].

The Chinese Open-Weight Value Play

<div style="max-width:620px;margin:2rem auto;"> <svg viewBox="0 0 620 260" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;font-family:ui-monospace,monospace;"> <rect width="620" height="260" rx="12" fill="#1a1a2e" stroke="#2a2a4a" stroke-width="1"/> <text x="310" y="28" text-anchor="middle" fill="#e2e8f0" font-size="13" font-weight="600">Benchmark Score vs Output Price (SWE-Bench Verified)</text> <text x="310" y="46" text-anchor="middle" fill="#64748b" font-size="10">Higher = better code. Leftward = cheaper. Best position = top-left.</text> <line x1="70" y1="60" x2="70" y2="210" stroke="#2a2a4a" stroke-width="0.5"/> <line x1="70" y1="210" x2="590" y2="210" stroke="#2a2a4a" stroke-width="0.5"/> <text x="66" y="74" text-anchor="end" fill="#64748b" font-size="9">82%</text> <text x="66" y="114" text-anchor="end" fill="#64748b" font-size="9">78%</text> <text x="66" y="154" text-anchor="end" fill="#64748b" font-size="9">74%</text> <text x="66" y="194" text-anchor="end" fill="#64748b" font-size="9">70%</text> <line x1="70" y1="70" x2="590" y2="70" stroke="#2a2a4a" stroke-width="0.3" stroke-dasharray="4"/> <line x1="70" y1="110" x2="590" y2="110" stroke="#2a2a4a" stroke-width="0.3" stroke-dasharray="4"/> <line x1="70" y1="150" x2="590" y2="150" stroke="#2a2a4a" stroke-width="0.3" stroke-dasharray="4"/> <line x1="70" y1="190" x2="590" y2="190" stroke="#2a2a4a" stroke-width="0.3" stroke-dasharray="4"/> <text x="90" y="226" fill="#64748b" font-size="9">$0.38</text> <text x="170" y="226" fill="#64748b" font-size="9">$3.00</text> <text x="280" y="226" fill="#64748b" font-size="9">$12</text> <text x="380" y="226" fill="#64748b" font-size="9">$15</text> <text x="530" y="226" fill="#64748b" font-size="9">$25</text> <text x="330" y="246" text-anchor="middle" fill="#64748b" font-size="9">Output cost per 1M tokens →</text> <circle cx="90" cy="194" r="8" fill="#10b981" opacity="0.8"/> <text x="90" y="188" text-anchor="middle" fill="#6ee7b7" font-size="8">DSv3.2</text> <text x="90" y="204" text-anchor="middle" fill="#6ee7b7" font-size="7">70.2%</text> <circle cx="168" cy="102" r="10" fill="#06b6d4" opacity="0.8"/> <text x="168" y="96" text-anchor="middle" fill="#67e8f9" font-size="8">K2.5</text> <text x="168" y="112" text-anchor="middle" fill="#67e8f9" font-size="7">76.8%</text> <circle cx="182" cy="90" r="10" fill="#8b5cf6" opacity="0.8"/> <text x="182" y="84" text-anchor="middle" fill="#c4b5fd" font-size="8">GLM-5</text> <text x="182" y="100" text-anchor="middle" fill="#c4b5fd" font-size="7">77.8%</text> <circle cx="290" cy="78" r="9" fill="#f97316" opacity="0.8"/> <text x="290" y="72" text-anchor="middle" fill="#fdba74" font-size="8">GPT-5.2</text> <text x="290" y="88" text-anchor="middle" fill="#fdba74" font-size="7">80.0%</text> <circle cx="390" cy="82" r="9" fill="#ec4899" opacity="0.8"/> <text x="390" y="76" text-anchor="middle" fill="#f9a8d4" font-size="8">Sonnet</text> <text x="390" y="92" text-anchor="middle" fill="#f9a8d4" font-size="7">77.2%</text> <circle cx="540" cy="70" r="9" fill="#ef4444" opacity="0.8"/> <text x="540" y="64" text-anchor="middle" fill="#fca5a5" font-size="8">Opus</text> <text x="540" y="80" text-anchor="middle" fill="#fca5a5" font-size="7">80.9%</text> </svg> </div>

The bottom line: Kimi K2.5 and GLM-5 are within 3–4 points of Claude Opus on SWE-Bench Verified — at $3.00–3.20/M output vs $25.00/M output. That's a 7–8x cost reduction for ~96% of the coding capability. Both are fully MIT-licensed with open weights.

Part 3: The Math — Cost Modeling

Token Economics Primer

A typical OpenClaw conversation turn consumes:

System prompt + memory context: ~2,000 tokens (input)
User message: ~100 tokens (input)
Tool calls + results: ~500 tokens (input/output)
Agent response: ~300 tokens (output)

Per-turn total: ~2,600 input + ~800 output tokens

Cost Per Turn by Provider (Including New Models)

Provider	Model	Cost/Turn	100 turns/day	30-day cost
Together AI	Gemma 3n E4B	$0.000084	$0.008	$0.25
Groq	Llama 3.1 8B	$0.000194	$0.019	$0.58
Cerebras	Llama 3.1 8B	$0.000340	$0.034	$1.02
Groq	GPT-OSS-120B	$0.000870	$0.087	$2.61
OpenRouter	DeepSeek V3.2	$0.000954	$0.095	$2.87
OpenRouter	Kimi K2.5	$0.002970	$0.297	$8.91
Z.ai	GLM-5	$0.005160	$0.516	$15.48
Z.ai	GLM-4.7-Flash	$0.000000	$0.000	$0.00
Groq	Llama 3.3 70B	$0.002166	$0.217	$6.50
Direct	Claude Sonnet 4.5	$0.019800	$1.980	$59.40
Direct	Claude Opus 4.6	$0.033000	$3.300	$99.00

Formula: Cost/turn = (input_tokens × input_price/1M) + (output_tokens × output_price/1M)

Kimi K2.5 (OpenRouter): (2,600 × $0.45/1M) + (800 × $2.25/1M) = $0.001170 + $0.001800 = $0.002970

GLM-5 (Z.ai): (2,600 × $1.00/1M) + (800 × $3.20/1M) = $0.002600 + $0.002560 = $0.005160

Monthly Cost Projection — Visual Scaling

<div style="max-width:620px;margin:2rem auto;"> <svg viewBox="0 0 620 340" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;font-family:ui-monospace,monospace;"> <rect width="620" height="340" rx="12" fill="#1a1a2e" stroke="#2a2a4a" stroke-width="1"/> <text x="310" y="28" text-anchor="middle" fill="#e2e8f0" font-size="13" font-weight="600">Monthly Cost by Daily Message Volume</text> <text x="50" y="62" text-anchor="end" fill="#94a3b8" font-size="10">$600</text> <text x="50" y="102" text-anchor="end" fill="#94a3b8" font-size="10">$450</text> <text x="50" y="142" text-anchor="end" fill="#94a3b8" font-size="10">$300</text> <text x="50" y="182" text-anchor="end" fill="#94a3b8" font-size="10">$150</text> <text x="50" y="222" text-anchor="end" fill="#94a3b8" font-size="10">$50</text> <text x="50" y="262" text-anchor="end" fill="#94a3b8" font-size="10">$0</text> <line x1="60" y1="58" x2="580" y2="58" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="98" x2="580" y2="98" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="138" x2="580" y2="138" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="178" x2="580" y2="178" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="218" x2="580" y2="218" stroke="#2a2a4a" stroke-width="0.3"/> <line x1="60" y1="258" x2="580" y2="258" stroke="#94a3b8" stroke-width="0.5"/> <text x="100" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">25/day</text> <text x="204" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">100/day</text> <text x="308" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">250/day</text> <text x="412" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">500/day</text> <text x="516" y="276" text-anchor="middle" fill="#94a3b8" font-size="10">1000/day</text> <polyline points="100,257 204,257 308,256 412,255 516,254" fill="none" stroke="#10b981" stroke-width="2.5" stroke-linecap="round"/> <polyline points="100,255 204,249 308,236 412,214 516,170" fill="none" stroke="#06b6d4" stroke-width="2.5" stroke-linecap="round"/> <polyline points="100,253 204,241 308,215 412,172 516,86" fill="none" stroke="#8b5cf6" stroke-width="2.5" stroke-linecap="round"/> <polyline points="100,243 204,218 308,178 412,98 516,58" fill="none" stroke="#ec4899" stroke-width="2.5" stroke-linecap="round"/> <polyline points="100,225 204,178 308,98 412,58 516,58" fill="none" stroke="#ef4444" stroke-width="2.5" stroke-linecap="round" stroke-dasharray="6,3"/> <circle cx="100" cy="257" r="3" fill="#10b981"/><circle cx="204" cy="257" r="3" fill="#10b981"/><circle cx="308" cy="256" r="3" fill="#10b981"/><circle cx="412" cy="255" r="3" fill="#10b981"/><circle cx="516" cy="254" r="3" fill="#10b981"/> <circle cx="100" cy="255" r="3" fill="#06b6d4"/><circle cx="204" cy="249" r="3" fill="#06b6d4"/><circle cx="308" cy="236" r="3" fill="#06b6d4"/><circle cx="412" cy="214" r="3" fill="#06b6d4"/><circle cx="516" cy="170" r="3" fill="#06b6d4"/> <circle cx="100" cy="253" r="3" fill="#8b5cf6"/><circle cx="204" cy="241" r="3" fill="#8b5cf6"/><circle cx="308" cy="215" r="3" fill="#8b5cf6"/><circle cx="412" cy="172" r="3" fill="#8b5cf6"/><circle cx="516" cy="86" r="3" fill="#8b5cf6"/> <circle cx="100" cy="243" r="3" fill="#ec4899"/><circle cx="204" cy="218" r="3" fill="#ec4899"/><circle cx="308" cy="178" r="3" fill="#ec4899"/><circle cx="412" cy="98" r="3" fill="#ec4899"/> <rect x="90" y="290" width="12" height="3" rx="1" fill="#10b981"/><text x="108" y="294" fill="#94a3b8" font-size="9">Groq 8B ($0.58)</text> <rect x="195" y="290" width="12" height="3" rx="1" fill="#06b6d4"/><text x="213" y="294" fill="#94a3b8" font-size="9">Kimi K2.5 ($8.91)</text> <rect x="310" y="290" width="12" height="3" rx="1" fill="#8b5cf6"/><text x="328" y="294" fill="#94a3b8" font-size="9">GLM-5 ($15.48)</text> <rect x="405" y="290" width="12" height="3" rx="1" fill="#ec4899"/><text x="423" y="294" fill="#94a3b8" font-size="9">Sonnet ($59)</text> <rect x="495" y="290" width="12" height="3" rx="1" fill="#ef4444"/><text x="513" y="294" fill="#94a3b8" font-size="9">Opus ($99)</text> <text x="310" y="332" text-anchor="middle" fill="#64748b" font-size="9" font-style="italic">Monthly cost at 100 msgs/day reference point. Opus/Sonnet scale exceeds chart at high volumes.</text> </svg> </div>

Daily Messages	Groq 8B	Kimi K2.5	GLM-5	GLM-4.7-Flash	Sonnet 4.5	Opus 4.6
25	$0.15	$2.23	$3.87	$0.00	$14.85	$24.75
100	$0.58	$8.91	$15.48	$0.00	$59.40	$99.00
250	$1.46	$22.28	$38.70	$0.00	$148.50	$247.50
500	$2.91	$44.55	$77.40	$0.00	$297.00	$495.00
1,000	$5.82	$89.10	$154.80	$0.00	$594.00	$990.00

The standout: GLM-4.7-Flash is genuinely free (rate-limited) with 200K context. For agent prototyping and high-volume tool-calling, this is unbeatable at $0.

Part 4: Smart Routing — The 78% Cost Reduction

The Problem

A naive OpenClaw config sends every request — including "what time is it?" and "check my calendar" — to the same model. If that model is Claude Opus at $25/1M output tokens, you're paying frontier prices for trivial tasks.

ClawRouter Architecture

ClawRouter analyzes each prompt locally (<1ms, zero API calls) using a 14-dimension scoring system and routes to the cheapest capable model [2].

Tier	% of Requests	Route To	Avg Cost/Turn	Weighted Cost
SIMPLE (45%)	Status, math, time	GLM-4.7-Flash	$0.0000	$0.000000
MODERATE (30%)	Summarize, translate	Kimi K2.5	$0.0030	$0.000900
COMPLEX (20%)	Coding, reasoning	GLM-5 or Sonnet	$0.0052	$0.001040
EXPERT (5%)	Architecture, research	Claude Opus 4.6	$0.0330	$0.001650
Blended	100%			$0.003590

Without routing (all Claude Opus): $0.033/turn With routing (blended, using new models): $0.0036/turn Savings: 89.1% (up from 81.7% before Kimi K2.5 and GLM free tier)

One developer reported their Anthropic bill dropped from $4,660/month to ~$1,400/month using ClawRouter — a 70% reduction — because ~60% of their agent's requests were simple enough for budget models [2].

OpenRouter Alternative

OpenRouter offers similar capability through its Auto Model feature [4]:

Feature	OpenRouter	ClawRouter
Routing location	Server-side	Local (<1ms)
BYOK support	Yes (1M free req/month)	N/A
Fee structure	0% markup + 5% BYOK after 1M	Free (open-source)
Model access	400+ models (incl. Kimi K2.5, GLM-5)	Configure your own
Failover	Automatic (50+ providers)	Manual config

Part 5: Local Hardware — M3 Ultra Mac Studio

Why Consider Local Inference?

Cloud APIs win on per-token cost for most workloads. But local hardware makes sense when: (1) privacy/data sovereignty is non-negotiable, (2) you need offline access, (3) you want to run massive 400B+ models, or (4) you're deploying fine-tuned models not available via API.

M3 Ultra Mac Studio Specifications

Spec	M3 Ultra (32C/80G)
CPU	32-core (24P + 8E)
GPU	80-core
Max Unified Memory	512GB LPDDR5x
Memory Bandwidth	819 GB/s
TDP	~100W under load
Price (256GB)	$6,999
Price (512GB)	$9,499

Key advantage: 512GB of unified memory in a single box — enough to run DeepSeek V3 671B quantized. An equivalent GPU setup (8x H100) would cost $200K+.

Tokens/Second Benchmarks (MLX Framework)

<div style="max-width:620px;margin:2rem auto;"> <svg viewBox="0 0 620 380" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;font-family:ui-monospace,monospace;"> <rect width="620" height="380" rx="12" fill="#1a1a2e" stroke="#2a2a4a" stroke-width="1"/> <text x="310" y="28" text-anchor="middle" fill="#e2e8f0" font-size="13" font-weight="600">M3 Ultra 512GB: Generation Speed by Model (MLX)</text> <text x="130" y="62" text-anchor="end" fill="#e2e8f0" font-size="10">Gemma 3 1B (Q4)</text> <rect x="140" y="49" width="395" height="20" rx="3" fill="#10b981" opacity="0.85"/> <text x="545" y="63" fill="#6ee7b7" font-size="10" font-weight="600">237 t/s</text> <text x="130" y="90" text-anchor="end" fill="#e2e8f0" font-size="10">Gemma 3 4B (Q4)</text> <rect x="140" y="77" width="223" height="20" rx="3" fill="#10b981" opacity="0.75"/> <text x="373" y="91" fill="#6ee7b7" font-size="10" font-weight="600">134 t/s</text> <text x="130" y="118" text-anchor="end" fill="#e2e8f0" font-size="10">Llama 3.1 8B (4-bit)</text> <rect x="140" y="105" width="217" height="20" rx="3" fill="#3b82f6" opacity="0.85"/> <text x="367" y="119" fill="#93c5fd" font-size="10" font-weight="600">130 t/s</text> <text x="130" y="146" text-anchor="end" fill="#e2e8f0" font-size="10">QwQ 32B (4-bit)</text> <rect x="140" y="133" width="58" height="20" rx="3" fill="#f59e0b" opacity="0.85"/> <text x="208" y="147" fill="#fcd34d" font-size="10" font-weight="600">35 t/s</text> <text x="130" y="174" text-anchor="end" fill="#e2e8f0" font-size="10">Qwen3 235B (FP8)</text> <rect x="140" y="161" width="50" height="20" rx="3" fill="#8b5cf6" opacity="0.85"/> <text x="200" y="175" fill="#c4b5fd" font-size="10" font-weight="600">30 t/s</text> <text x="130" y="202" text-anchor="end" fill="#e2e8f0" font-size="10">DeepSeek V3 671B</text> <rect x="140" y="189" width="35" height="20" rx="3" fill="#ec4899" opacity="0.85"/> <text x="185" y="203" fill="#f9a8d4" font-size="10" font-weight="600">21 t/s</text> <text x="130" y="230" text-anchor="end" fill="#e2e8f0" font-size="10">DeepSeek R1 671B</text> <rect x="140" y="217" width="33" height="20" rx="3" fill="#ec4899" opacity="0.75"/> <text x="183" y="231" fill="#f9a8d4" font-size="10" font-weight="600">20 t/s</text> <text x="130" y="258" text-anchor="end" fill="#e2e8f0" font-size="10">Llama 3.3 70B (Q4)</text> <rect x="140" y="245" width="28" height="20" rx="3" fill="#3b82f6" opacity="0.75"/> <text x="178" y="259" fill="#93c5fd" font-size="10" font-weight="600">17 t/s</text> <text x="130" y="286" text-anchor="end" fill="#e2e8f0" font-size="10">GLM-4.7 358B (Q3)</text> <rect x="140" y="273" width="25" height="20" rx="3" fill="#8b5cf6" opacity="0.75"/> <text x="175" y="287" fill="#c4b5fd" font-size="10" font-weight="600">15 t/s</text> <text x="310" y="318" text-anchor="middle" fill="#94a3b8" font-size="10">MoE models (DeepSeek, Qwen3 235B) outperform their size class</text> <text x="310" y="334" text-anchor="middle" fill="#94a3b8" font-size="10">because only active experts (~37B) need memory bandwidth per token</text> <text x="310" y="360" text-anchor="middle" fill="#64748b" font-size="9" font-style="italic">Sources: Hardware Corner, Lattice, Creative Strategies, MacStories — Feb 2026</text> </svg> </div>

Model	Size	Quant	tok/s (gen)	RAM Needed
Gemma 3 1B	1B	Q4	237	<4GB
Llama 3.1 8B	8B	4-bit	130	~5GB
QwQ 32B	32B	4-bit	35	~20GB
Qwen3 235B (MoE)	235B	FP8	30	~256GB
DeepSeek V3 (MoE)	671B	4-bit	21	~405GB
DeepSeek R1 (MoE)	671B	4-bit	20	~405GB
Llama 3.3 70B	70B	Q4_K_M	17	~40GB
GLM-4.7 358B (MoE)	358B	Q3	~15	~256GB

Critical context: DeepSeek V3's speed drops from 21 tok/s to 5.8 tok/s as context grows from 69 to 16K tokens. The KV cache competes for memory bandwidth. Plan for 10–15 tok/s at realistic conversation lengths.

Framework Choice Matters

Framework	Speed vs MLX	Best For
MLX (Apple native)	Baseline (fastest)	Maximum Apple Silicon performance
LM Studio (MLX backend)	~Same	GUI + ease of use
llama.cpp	20–50% slower	Cross-platform, broader model support
Ollama	20–40% slower	Easiest setup, REST API

Use MLX for best performance. For DeepSeek V3 671B: MLX achieved 21 tok/s vs llama.cpp's 6.2 tok/s — a 3.4x difference [11].

Cloud API vs. Local Hardware Break-Even

Usage Pattern	M3 Ultra 512GB ($/M tokens)	Cheapest Cloud (DeepInfra)	Winner
Light (2M tok/day)	$3.45/M	$0.30/M	Cloud wins 11x
Medium (10M tok/day)	$0.69/M	$0.30/M	Cloud wins 2.3x
Heavy 24/7	$5.30/M	$0.30/M	Cloud wins
Privacy/offline	Priceless	N/A	Local wins
Run 671B models	$0.69–5.30/M	$3.00–25.00/M (API)	Local wins

The uncomfortable truth: At current cloud API pricing, local inference rarely breaks even on pure cost. Cloud providers benefit from massive batch sizes, FP8 hardware optimization, and economies of scale. The M3 Ultra's value proposition is the 512GB unified memory pool — running models that would require $200K in GPU hardware otherwise.

M5 Ultra (expected late 2026): ~1,100 GB/s bandwidth (+34%), ~768GB max RAM, 3–4x faster prompt processing. Worth waiting for if not urgent [12].

Part 6: Security Cost — The Hidden Variable

OpenClaw's permissionless architecture introduces non-trivial security costs [8]:

Incident	Date	Impact
40,000+ exposed instances	Jan 2026	93.4% had auth bypass flaws
Moltbook data leak	Jan 2026	1.5M API tokens, 35K emails exposed
Scam token via hijacked agent	Jan 2026	$16M in losses
Prompt injection backdoors	Ongoing	Agents execute attacker instructions

Mitigation cost considerations:

Security Measure	Implementation	Cost Impact
Authentication (required)	Gateway token auth	+$0 (config change)
Docker sandboxing	Per-session containers	+10–20% memory overhead
Prompt injection defense	Latest-gen models only	+$10–50/month (use Claude over Llama)
NanoClaw	Hardened fork	$0 (open-source)

Conclusion & Updated Recommendation

Cost-Optimized OpenClaw Stack for Optimal (February 2026)

Layer	Choice	Monthly Cost
Hosting	Hetzner CX22 VPS	$4.15
Agent Platform	OpenClaw (MIT license)	$0
LLM Router	OpenRouter (BYOK, 1M free req/mo)	$0
Simple tasks (45%)	GLM-4.7-Flash (free tier)	$0
Moderate tasks (30%)	Kimi K2.5 via OpenRouter	~$5–10
Complex tasks (20%)	GLM-5 or Claude Sonnet 4.5	~$5–15
Expert tasks (5%)	Claude Opus 4.6 via BYOK	~$3–8
Security	NanoClaw fork + Docker sandboxing	$0
Total projected		$17–37/month

Decision Matrix (Updated)

If you need...	Choose...	Why
Fastest inference	Cerebras	2,988 tok/s, 3x faster than #2
Lowest latency	Fireworks AI	0.17s TTFT
Best free tier	GLM-4.7-Flash	Free, 200K context, genuinely capable
Best open-weight frontier	GLM-5 or Kimi K2.5	MIT license, 96% of Opus quality, 7–8x cheaper
Best agentic model	Kimi K2.5 (Swarm mode)	BrowseComp 78.4%, HLE-tools 50.2%
Best cost/performance	DeepSeek V3.2	$0.38/M output, solid for moderate tasks
Maximum cost control	ClawRouter + multi-model	89% savings via smart routing
Run 671B models locally	M3 Ultra 512GB	Only $9.5K box that fits DeepSeek V3
Zero spend	Oracle Free + Ollama	$0/month, limited capability
Regulatory safety	Avoid Z.ai direct	Use GLM models via OpenRouter/Fireworks

This report was generated on February 12, 2026 using parallel AI research agents (Claude, Anthropic). Updated with Kimi K2.5, GLM-5, and M3 Ultra Mac Studio benchmarks on the same date. All claims are hyperlinked to their sources. This is not financial advice.

Automated report produced for Optimal | Technology Category

Sources & References

<a id="ref-1"></a>[1] OpenClaw Official Documentation — Architecture, deployment guides, and hardware requirements.

<a id="ref-2"></a>[2] ClawRouter — Smart LLM Router — Open-source routing layer claiming 78% cost savings. See also: ClawRouter: How I Cut My $4,660 Bill by 70% (Medium).

<a id="ref-3"></a>[3] Open Source AI API Providers: Speed, Cost & Performance Compared (2026) — Independent benchmark of GPT-OSS-120B across 6 providers.

<a id="ref-4"></a>[4] OpenRouter Pricing & BYOK Documentation — 400+ models, 1M free BYOK requests/month. See also: OpenRouter BYOK Announcement.

<a id="ref-5"></a>[5] OpenClaw Deploy Cost Guide by WenHao Yu — Comprehensive hosting cost analysis ($0–8/month configurations).

<a id="ref-6"></a>[6] Z.ai (Zhipu AI) Wikipedia — Company background, HKEX listing, U.S. Entity List status.

<a id="ref-7"></a>[7] Contemplating Local LLMs vs OpenRouter and Z.ai — First-hand speed testing (20–30 tok/s on Z.ai direct).

<a id="ref-8"></a>[8] CrowdStrike: What Security Teams Need to Know About OpenClaw — Security analysis. See also: Infosecurity Magazine, Cisco Blog, Trend Micro.

<a id="ref-9"></a>[9] Kimi K2.5 on Hugging Face — Model card, benchmarks, architecture. See also: Artificial Analysis: Kimi K2.5, eesel.ai Pricing Guide, VentureBeat: K2.5 and Agent Swarms.

<a id="ref-10"></a>[10] GLM-5 on Hugging Face — 744B MoE open weights, MIT license. See also: Bloomberg: China's Zhipu Unveils New AI Model, SCMP: GLM-5 Launch, Artificial Analysis: GLM-5, Simon Willison: GLM-5 From Vibe Coding to Agentic Engineering.

<a id="ref-11"></a>[11] Hardware Corner: DeepSeek V3 on Mac Studio M3 Ultra — Comprehensive benchmarks. See also: Lattice: M3 Ultra Performance Benchmarks, Creative Strategies: Mac Studio M3 Ultra AI Review, MacStories: Testing DeepSeek R1 on M3 Ultra.

<a id="ref-12"></a>[12] Macworld: 2026 Mac Studio M5 Ultra Predictions — M4 Ultra confirmed skipped, M5 Ultra expected H2 2026 with ~1,100 GB/s bandwidth.

Additional Sources:

Cerebras Pricing — Model speeds and per-token costs.
Together AI Pricing — Full model catalog.
Fireworks AI Pricing — Serverless and on-demand pricing.
Groq Pricing — LPU-based inference, batch discounts.
DeepSeek API Pricing — Cache hit/miss pricing tiers.
Z.ai Official Pricing — Full GLM model catalog.
OpenRouter: Kimi K2.5 — Provider pricing comparison.
OpenRouter: GLM-5 — Western access pricing.
Apple M3 Ultra Specs — Official hardware specifications.
PricePerToken.com — LLM API pricing comparison (300+ models).
OpenRouter State of AI — Market share and usage trends.