Frontier MoE sleep/wake at TP=4 on consumer Blackwell

Sleep/wake rotation works fine for small dense models on a single GPU — load weights into pinned CPU RAM, swap them back to VRAM in under a second. That’s not the interesting case. The interesting case is frontier MoE at TP=4: ~17 GiB of sharded weights per GPU, sparse routing tables, KV cache across four ranks, and a co-resident peer model on the same physical cards. For months that combination on consumer Blackwell either OOM’d, hung in cuMemMap, or produced silent garbage outputs after a few rotation cycles.

It now works. /wake_up: ~2 seconds. Cross-peer swap (sleep model A, wake model B on the same 4 GPUs): ~4.5 seconds. Down from 50. Live in production on 4× RTX PRO 6000 Blackwell with DeepSeek-V4-Flash and MiMo-V2.5-Flash sharing the same TP=4 pool. This is what it took — a specific image stack, two cherry-picked upstream PRs, and one non-obvious config knob.

The bundle image is on GHCR — docker pull ghcr.io/doradusresearch/vllm-blackwell-sm12x-bundle:v4. Source repo is being prepared for clean re-release; in the meantime the image is sufficient to reproduce the result. The SM12x workspace-shrink patches are upstreamed as vllm#42856 (on top of #41834).

The setup

4× NVIDIA RTX PRO 6000 Blackwell Workstation Edition. 95 GiB per GPU. SM_120 (consumer Blackwell, no Fabric/RDMA). Goal: run a 2-model rotation pool where one is awake serving inference and the other is /sleep’d (weights copied to pinned CPU memory) so they share the same GPUs without doubling hardware.

In vLLM terms: --enable-sleep-mode with level-1 sleep. Documented use case, well-tested on H100/H200.

On consumer Blackwell, in our experience: an obstacle course.

The live spec for the two co-resident models:

Setting	DSv4-Flash	MiMo-V2.5-Flash
`--tensor-parallel-size`	4	4
`--max-model-len`	131072 (128K)	65536 (64K)
`--max-num-seqs`	12	6
`--max-num-batched-tokens`	8192	8192
`--gpu-memory-utilization`	0.60	0.85
`--enable-sleep-mode`	✓	✓

The --gpu-memory-utilization asymmetry is intentional and important. DSv4 has to run at 0.60 because of the PR #41834 workspace footprint + measured ~9 GiB-per-GPU sleep residue from the co-resident MiMo (covered below — and yes, that residue is higher than vLLM’s docs assume). MiMo has no equivalent non-cumem state and can stay at the default 0.85. Both share the same physical 4 cards; they just budget the cumem allocator differently.

First-time cold load: budget ~12 minutes per model from docker run to first request. That’s safetensors load (DSv4 is 46 shards, ~14 s/shard average from local NVMe) plus the CUDA-graph capture phase. Image pull adds more on a fresh host (the bundle is ~29 GB). Subsequent restarts on the same host are dominated by the shard load — page-cache hits help but don’t eliminate it. Plan accordingly; if you’re A/B-testing config knobs, restart cost matters.

What broke, in order

1. The b12x community image had cumem state bugs

The closest existing image is the voipmonitor b12x fork, which patches vLLM for SM_120-specific kernel paths. Excellent baseline, but on cross-peer /wake_up we got CUDA Error: invalid argument at cumem_allocator.cpp:145 reliably.

After digging it turned out this is the bug PR #35489 addresses: vLLM’s cumem allocator queries CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED to decide whether to use Fabric handles, and on hardware without Fabric (all consumer Blackwell), that query returns CUDA_ERROR_INVALID_VALUE. The error code gets cached in a global, and the next cuMemMap returns EINVAL because it sees the stale error state.

The fix is a one-line error_code = no_error; reset at the top of create_and_map. The PR has been open since March; it isn’t in any released vLLM image yet.

2. The cumem cycle leak is real but manageable

Even with #35489 applied, repeated /sleep + /wake_up cycles leak pinned GPU memory. We measured up to ~5 GiB per cycle, eventually overlapping live weight tensors and producing silent garbage outputs around cycle 4-5.

There are about 7 open vLLM issues tracking this (#36651, #21336, #34600, #37111, etc.). PR #34600 adds proper rollback in wake_up when partial allocations fail — important for not leaking on every failed wake even if it doesn’t directly fix the slow per-cycle accumulation. We cherry-pick that too.

3. The PR that fixes everything for DeepSeek-V4 also breaks the memory budget

PR #41834 is the upstream effort to land native DeepSeek-V4 support on SM12x: Triton sparse-MLA fallback, DeepGEMM-free paths, MLA prefix-cache fix. It’s the only credible path to native DSv4 without a community fork, and it makes /sleep synchronous (no more 35-second POST_SLEEP_GAP_S workaround we used to need with the jasl-on-cu130 image).

But PR #41834 also adds about ~22 GiB of GPU state per GPU that lives outside the vLLM cumem allocator’s budget — sparse-MLA workspace, marlin scratch, cuda-graph private pools. On H200 (140 GiB per GPU) this is invisible headroom. On consumer Blackwell (95 GiB) it overflows.

vLLM’s default --gpu-memory-utilization 0.85 × 95 GiB = ~80 GiB cumem budget. Plus the 22 GiB non-cumem. Plus a few GiB for any sleeping co-tenant peer. Total: ~105 GiB on a 95 GiB GPU. OOM.

4. The 22 GiB pool ignores every config knob you’d reach for

We spent a couple of days trying to shrink the 22 GiB:

--max-cudagraph-capture-size from 12 to 4 to 1: no effect.
cudagraph_mode: PIECEWISE only (vs FULL+PIECEWISE): no effect.
--enforce-eager (disable cuda graphs entirely): no effect. ← the diagnostic.
--max-num-batched-tokens 8192 → 2048: no effect.
--max-num-seqs 12 → 4: no effect.
--max-model-len 131072 → 32768: no effect.

The 22 GiB is hardcoded into PR #41834’s compilation framework. It’s workspace tensors and pre-allocated buffers that the user can’t tune from spec.

The misleading thing: PyTorch’s OOM message labels this “22.7 GiB allocated in private pools (e.g., CUDA Graphs)”. The “(e.g., CUDA Graphs)” sent us chasing graph-capture configuration for days. The label is technically correct but misleading — PyTorch’s “private pools” catches any tensor allocated via a torch.cuda.MemPool, including allocations done outside graph capture. The 22 GiB is workspace tensors, not graphs.

5. The actual fix is operational, not source-level

Once we understood that the 22 GiB is a hardcoded non-cumem footprint, the fix becomes obvious — for DSv4 specifically:

# DSv4-Flash (sparse-MLA via PR #41834):
--gpu-memory-utilization 0.60   # not 0.85, not even 0.70

# MiMo-V2.5-Flash (dense MLA, no sparse-MLA workspace):
--gpu-memory-utilization 0.85   # default is fine

The knob is DSv4-specific. MiMo doesn’t carry the PR #41834 workspace footprint, so its cumem budget can stay at the default. The rotation pool budgets each model independently — they don’t have to match.

Why 0.60 and not 0.70 (correction from an earlier draft of this post): the first cut tried 0.70, reasoning 0.70 × 95 = 66.5 cumem + 22 non-cumem + ~4 GiB residue ≈ 92 / 95. It crashed at startup with CUDA out of memory. Tried to allocate 1.56 GiB. GPU 0 has... 1.33 GiB free. Measured live via nvidia-smi after the crash: MiMo’s sleeping-peer residue is ~9 GiB per GPU, not the ~4 GiB the vLLM design assumes. So the real math is 66.5 + 22 + 9 ≈ 97.5 / 95 — overflows. Dropped to 0.60 (57 + 22 + 9 = 88 / 95, ~7 GiB margin). Boots cleanly. Holds across the full 128K context range — see the throughput section below.

Homer Simpson facepalming with a giant D'OH! — Four weeks. Cumem allocator bugs. `cuMemMap` race conditions. PR #35489. PR #34600. PR #41834. Sparse-MLA workspace archaeology. PyTorch private-pool misattribution. **And the final unlock was one config knob.**

0.60 × 95 GiB = 57 GiB cumem budget. Plus 22 GiB non-cumem. Plus ~9 GiB measured sleeping-peer residue. Total: ~88 GiB on a 95 GiB GPU. ~7 GiB margin. Fits cleanly. Keep an eye on residue if you stack a third co-resident — the budget tightens fast.

The cost: cumem budget is smaller, so less room for weights + KV cache. DeepSeek-V4 weights at TP=4 are ~17 GiB per GPU; KV cache for max_model_len=131072 max_num_seqs=12 at FP8 is ~7 GiB. Total ~24 GiB, comfortably inside 66.5 GiB.

Measured outcome

Operation	Before (jasl-on-cu130 image)	After (this stack)
`/sleep`	Async, returns 200 in ~5s, actual unmap takes another 35s	Synchronous, ~2.5s (returns when actually done; range 1.8–3.0s)
`/wake_up` (cross-peer, after `/sleep`)	Needed 35s `POST_SLEEP_GAP_S` workaround to avoid `cuMemMap` race	~2s (range 1.5–2.6s), no workaround
Cross-peer swap total	~50s	~4.5s (range 4.4–4.5s across 3 live cycles)

4.5 seconds is the headline number. For a rotation pool where one user request can trigger a cross-peer swap, 50s → 4.5s is the difference between “annoying” and “invisible.”

What’s still slow: decode TPS is PCIe-sync-bound

Rotation works. Steady-state decode TPS is the next bottleneck. Profiling DeepSeek-V4-Flash decode at TP=4 on this stack (cumulative across a representative workload):

Op	Time
`sparse_accumulate_indexed_attention`	117.9 s
`nccl_allreduce`	55.1 s

Every decode step does a cross-GPU reduction over partial attention results. At TP=4 with sparse-MLA, the per-step PCIe round-trip is dominating wall-time — nccl_allreduce is pure inter-rank communication, and a sizable chunk of sparse_accumulate_indexed_attention is waiting on the next all-reduce to complete. Compute isn’t the cap; PCIe round-trip latency between ranks is.

The implication for TPS: for DSv4-class sparse-MLA models, single-stream decode at TP=4 on this stack is communication-starved. Higher concurrency helps (batched serving amortizes the per-step round-trip across many requests). If your workload is latency-sensitive single-stream decode of DSv4 specifically, this isn’t the right stack — pick a topology with a better-bandwidth interconnect for that one workload.

Decode TPS across context length (DSv4-Flash, TP=4, single-stream)

Live sweep against the production rotation pool. 7 input sizes from ~500 to ~100K tokens. Each request requested 256 output tokens with max_tokens=256, temperature=0, prompted with an instruction that forces a structured ~250-token response so decode dominates the wall time. MiMo was sleeping for every measurement (the same condition as the rotation-pool use case).

Input tokens	TTFT (s)	Decode TPS	E2E (s)
477	0.54	69.7	4.2
1,724	0.68	68.9	4.4
6,669	2.09	66.0	6.0
26,535	9.37	58.6	13.7
52,980	17.34	51.8	22.3
79,468	23.06	45.8	28.6
99,291	21.12	41.9	27.2

Decode degrades from ~70 → ~42 tok/s across the full advertised context. That’s the PCIe-sync curve in numbers: more KV-cache state to all-reduce per step, longer per-step round-trip. TTFT scales roughly linearly with prefill (~3.4K tokens/sec prefill throughput) — the 120K row’s slightly-lower TTFT vs 96K is most likely prefix-cache hits on the repeating filler content, not a measurement error.

What this means for use-case selection:

Chat-length contexts (<5K): ~68-70 tok/s decode, ~0.5-2s TTFT. Comfortable for interactive chat or autocomplete.
Document review (8K-30K): ~58-66 tok/s decode, 2-10s TTFT. Workable for batch document analysis.
Long-context retrieval / agentic (50K-100K): ~42-52 tok/s decode, 17-23s TTFT. Useful for offline workloads; user-perceived latency at this scale will be dominated by the prefill, not the decode.

These numbers come from /tmp/tps-sweep-v2.py against the live cluster and are reproducible against the same image. We’re publishing the script in the repo alongside the bundle Dockerfile when the source repo re-opens.

What we learned that wasn’t obvious

PyTorch’s “private pools” OOM label includes the cumem allocator’s MemPool. Not just cuda graphs. If you’re chasing graph-capture configuration based on this label, you might be looking at the wrong thing.
current_platform.is_device_capability_family(120) is the right pattern for SM12x-specific config. If you’re adding consumer-Blackwell adaptations, gate them behind this so non-SM12x users see no change.
PR #41834’s perf wins come with memory costs that aren’t called out in the PR description. The 22 GiB non-cumem footprint isn’t documented. If you’re testing on H200 you’d never notice; on consumer Blackwell you’ll hit it immediately.
--enforce-eager as a DIAGNOSTIC is more useful than as a deployment posture. It’s the cleanest way to ask “is this OOM coming from graphs?” Even when you don’t want to deploy with it, run one experiment with it to disambiguate.
The cumem cycle leak is real but secondary. We chased it for weeks thinking it was the primary blocker. With PR #35489 + PR #34600 + the 0.70 config, we no longer have to alloc-restart for cumem accumulation under our normal rotation load.

Honest status note (2026-05-17 — bundle:v5 validation + the DeepGEMM gap)

We built bundle:v5 to unify both models on one image (cu129-nightly base + PR #35489 cumem error_code reset + PR #34600 wake_up partial-map rollback + the MiMo V-pad diffkv overlay baked in). Half worked, half exposed a separate gap. The honest writeup:

MiMo on bundle:v5: validated, cycle-stable. Three back-to-back wake/sleep cycles after the cold-load:

Cycle	Wake	Sleep
1	2.66s	2.99s
2	2.57s	2.92s
3	2.56s	2.94s

Steady-state is ~2.6s wake / ~2.95s sleep. The very first /sleep of a fresh cold-load was 49s — that’s CUDA-graph teardown + cumem private-pool initialization, paid exactly once per process lifetime, not per cycle. We mention the 49s explicitly because we initially reported it as a steady-state number and a reviewer caught it; the multi-cycle test above is the corrected reading. The cumem race PR #35489 was reproducing on the stock cu129-nightly image is no longer reproducing on bundle:v5.

DSv4 on bundle:v5: failed at engine init. Worker __init__ raised RuntimeError: Sparse Attention Indexer CUDA op requires DeepGEMM to be installed from vllm/model_executor/layers/sparse_attn_indexer.py:442. Two things are going on here that took us a minute to untangle:

Bundle:v5’s overlay already patches vllm/v1/attention/backends/mla/indexer.py to skip the DeepGEMM get_paged_mqa_logits_metadata call on SM_120 (the SM_120 compute path ignores schedule metadata anyway). That handles the MLA-backend indexer.
There’s a second sparse-attention layer at vllm/model_executor/layers/sparse_attn_indexer.py which is the upstream layer-level SparseAttnIndexer DSv4’s Lightning Indexer instantiates. It does a hard import deep_gemm and raises if it’s not installed. That file didn’t exist on the older vllm-openai:deepseekv4-cu130 base the previous DSv4 image (we’ll call it bundle:v4) was built on, so it never came up.

DeepGEMM itself doesn’t ship SM_120 kernels. Inspect csrc/jit_kernels/impls/ on the main branch: you’ll find sm90_*, sm100_*, and smxx_* variants, no sm120_*. So “just install DeepGEMM” doesn’t actually fix it on consumer Blackwell — the JIT would compile, then the runtime kernels would assert. The real fix is what the existing indexer.py overlay does: patch sparse_attn_indexer.py to detect SM_120 and route to the Triton fallback path the jasl PR #41834 overlay already provides for the MLA backend.

That patch (we’ll call it bundle:v6) is straightforward and we’ll ship it. Meanwhile the production rotation pool is hybrid: DSv4 stays on bundle:v4 (the older deepseekv4-cu130 base where sparse_attn_indexer.py doesn’t exist), MiMo runs on bundle:v5. Cross-peer rotation still works cleanly because each model owns its own sleep/wake state; the cumem fix matters per-process, not per-pool.

One architectural observation worth pulling out: bundle:v5 is architecture-agnostic for everything except the sparse-MLA / Lightning Indexer family. The cumem cherry-picks + V-pad overlay + wake_up rollback are model-independent — any TP=4 dense, dense-MoE, or non-sparse-MLA model should rotate cleanly on bundle:v5. Only DSv4-class models drag in the DeepGEMM dependency, and only because of the Lightning Indexer’s sparse-attn layer. If your rotation pool doesn’t include a DSv4-family model, you can ignore the bundle:v5 → bundle:v6 work entirely and run two non-DSv4 models on bundle:v5 today.

What about other attention types

Sleep-mode behavior on consumer Blackwell is attention-type sensitive. What works for one architecture’s KV cache layout and routing table doesn’t necessarily work for another. What we’ve validated on this stack:

Model	Attention	Status
DeepSeek-V4-Flash	sparse-MLA (Triton fallback via PR #41834)	Full sleep/wake at TP=4. Live.
MiMo-V2.5-Flash	dense MLA	Full sleep/wake at TP=4. Live.
Q3-Coder-Next-80B-A3B	hybrid DeltaNet + SWA	`/sleep level=2` does not release cleanly on the cu129 image — VRAM stays held. We keep this one always-awake. Tracking vllm#41602.
Single-GPU 7B-class dense	standard MHA / GQA	Sleep/wake works trivially. Sub-second on llama-swap pools. This is the easy case.

If you’re trying sleep-mode on a model with an unusual attention variant (hybrid linear/SWA, RWKV-style, Mamba) on consumer Blackwell, budget extra time. The cumem allocator path is uniform; the per-architecture /sleep + /wake_up release paths are not.

What ships

The runtime image is on GHCR:

docker pull ghcr.io/doradusresearch/vllm-blackwell-sm12x-bundle:v4

It bundles the PR #41834 vLLM wheel + the two cumem cherry-picks (#35489 + #34600) + the SM12x-gated workspace shrinks, on top of the upstream vllm/vllm-openai:v0.20.2-cu129-ubuntu2404 base. Run it with the config table at the top of this post (TP=4, --enable-sleep-mode, --gpu-memory-utilization 0.70 for DSv4 / 0.85 for MiMo).

The source repository (Dockerfile + patches + Nomad example) is being scrubbed and re-released. Pull the image to reproduce the result today.

Apache-2.0. Feedback welcome — especially benchmarks on hardware we haven’t tested.

Acknowledgements

@jasl + @aabbccddwasd for PR #41834, which is the entire SM12x DSv4 enablement. @haosdent for PR #35489, the one-line fix that took us a week to find. The vLLM cumem allocator authors for the sleep/wake_up design that makes sub-5-second model swap possible at all. Our small workspace-shrink contribution back is at #42856.