Skip to main content

Strix Halo at Full Context — Why Your Decode Drops 64% and What Actually Fixes It

I’ve been running a Minisforum MS-S1 Max (AMD Ryzen AI MAX+ 395, Radeon 8060S iGPU, 128GB LPDDR5X) as my homelab’s long-context inference tier for months. The headline from the first round: Qwen3.6-35B-A3B at ~25 tok/s with 153k tokens of live context, all under 100W.

A lot has changed since then. ROCm 7.13 finally got gfx1151 codegen working (7.2.2 could see the GPU but couldn’t compile shaders). MTP merged to llama.cpp main on May 16. I’ve run three models across two backends at three prompt lengths plus a dedicated full-context decode test.

These are all the numbers.

Hardware and Software
#

ComponentSpec
APUAMD Ryzen AI MAX+ 395 (Radeon 8060S iGPU, gfx1151)
Memory128GB LPDDR5X unified
Practical weight limit~100GB (48GB reserved for system/KV/compute)
Practical max context262k (DeltaNet hybrid), 131k (dense)
ROCm backend7.13.0a20260515, therock-gfx1151 codegen path
Vulkan backendVulkan 1.3 RADV, unified heap enabled
llama.cppb9188 (604990613), May 16, 2026
ROCm build-DGGML_HIP=ON, CMAKE_PREFIX_PATH=therock-gfx1151
Vulkan build-DGGML_VULKAN=ON
Servingllama-swap proxy, temperature 0.1 for all tests

All tests used -c 262144, -np 1, -t 16, -tb 32, -ngl 999, -fa on, –no-mmap. Prompts were made unique per model to prevent KV cache hits between tests.

Models Tested
#

ModelArchitectureQuantSizeActive Params
Qwen3.6-35B-A3BMoE (3B active)Q8_K_XL38.5GB3B
Qwen3.6-27BDenseQ8_0~27GB27B
Qwen3.5-122B-A10BMoE (10B active)Q4_K_L~99GB10B

Each tested with and without MTP (--spec-type draft-mtp --spec-draft-n-max 2) on both ROCm and Vulkan: 12 total configurations.

Historical Results (for reference)
#

April 2026 — Vulkan RADV, 9-Model Shootout (llama.cpp b8637)
#

ModelQuantDecode tok/sPrefill tok/sNotes
Qwen3-Coder-Next 80BQ8_036.2~700Fastest MoE at the time
Gemma 4 26B-A4BQ8_031.2~600
GPT-oss 120BMXFP438.3~800Fastest, but zero needle retrieval
MiniMax M2.5Q3_K_S27.2~500
Qwen3.5-122BQ6_K_L18.2~400Production baseline
Step-3.5-FlashQ3_K_XL25.9~450
Devstral-2 123BQ5_K_M2.7~100Too slow
dots1 142BQ4_K_M3.1~80Too slow

April 2026 — Vulkan Qwen3.6-35B Q8 at Various Context Depths (b8762)
#

ContextPrefill tok/sDecode tok/s
1k tokens1043.232.0
32k tokens703.130.0
64k tokens460.628.6

BF16 was tested at the same time: 9.8 tok/s decode (3x slower than Q8 due to bandwidth saturation). Abandoned.

May 2026 — ROCm 7.13 Initial Results (b9188)
#

First ROCm results on gfx1151 with therock-gfx1151 codegen. Empty context, Qwen3.6-35B Q8:

PromptPrefill tok/sDecode tok/sPower (W)Temp (°C)
Simple math (17 tok)167.648.174-8948
Haiku (17 tok)165.845.974-8948
CPU vs GPU essay (26 tok)186.946.174-8948
Longer prompt (26 tok)192.046.174-8948

ROCm at 46 tok/s was 2.3x the Vulkan baseline of 20 tok/s. Decode was remarkably stable across prompt depths — DeltaNet’s linear layers shine. Prefill degraded 56% from 1k to 64k on Vulkan (1043 to 461 tok/s).

Round 1: Empty Context, Single Prompt
#

55-token prompt (TCP/UDP explanation), 200 output tokens.

Qwen3.6-35B-A3B (MoE, Q8_K_XL)
#

ConfigPrefill tok/sDecode tok/sMTP draft/accept
ROCm non-MTP23746.0-
ROCm MTP20558.3152/122 (80%)
Vulkan non-MTP25932.6-
Vulkan MTP25245.6140/128 (91%)

Qwen3.6-27B (dense, Q8_0)
#

ConfigPrefill tok/sDecode tok/sMTP draft/accept
ROCm non-MTP1047.7-
ROCm MTP8513.2145/126 (87%)
Vulkan non-MTP1437.3-
Vulkan MTP9410.7144/127 (88%)

Qwen3.5-122B-A10B (MoE, Q4_K_L)
#

ConfigPrefill tok/sDecode tok/sMTP draft/accept
ROCm non-MTP11423.2-
ROCm MTP10130.1146/125 (86%)
Vulkan non-MTP9326.7-
Vulkan MTP8922.7142/127 (89%)

Key finding from this round: ROCm wins decode on 35B and 27B. 122B Vulkan non-MTP (26.7) edges ROCm (23.2) — likely ROCm HIP overhead scales worse with 10B active params. But MTP flips it back (ROCm 30.1 vs Vulkan 22.7). Dense 27B is unusable for interactive work at 7-13 tok/s regardless of backend.

Round 2: Three Prompt Lengths
#

Three lengths: chat (~25 words, 100 output), coding (~87 words, 400 output), max context (~24,600 words / ~78k tokens, 30 output). Unique prompts per model.

Qwen3.6-35B-A3B (PP tok/s / TG tok/s)
#

ConfigChatCodingMax Context
ROCm non-MTP257 / 46.2418 / 45.8453 / 35.9
ROCm MTP226 / 63.7368 / 51.1490 / 41.7
Vulkan non-MTP154 / 32.7304 / 32.5533 / 30.0
Vulkan MTP153 / 46.8268 / 37.8465 / 38.8

Qwen3.6-27B (PP tok/s / TG tok/s)
#

ConfigChatCodingMax Context
ROCm non-MTP119 / 7.7162 / 7.6157 / 6.4
ROCm MTP101 / 14.2133 / 11.5154 / 10.1
Vulkan non-MTP65 / 7.4117 / 7.3190 / 7.0
Vulkan MTP55 / 11.3103 / 9.8158 / 10.7

Qwen3.5-122B-A10B (PP tok/s / TG tok/s)
#

ConfigChatCodingMax Context
ROCm non-MTP98 / 23.3176 / 23.0266 / 14.3
ROCm MTP82 / 31.0166 / 26.1215 / 21.4
Vulkan non-MTP105 / 26.8179 / 26.6233 / 24.5
Vulkan MTP60 / 23.2108 / 24.5185 / 23.7

MTP accept rates (chat/coding/max context):

  • 35B ROCm: 91%/65%/90%
  • 35B Vulkan: 94%/67%/90%
  • 27B ROCm: 96%/69%/90%
  • 27B Vulkan: 97%/68%/100%
  • 122B ROCm: 88%/68%/90%
  • 122B Vulkan: 93%/72%/82%

Decode is remarkably stable across prompt lengths (TG barely changes between 100 and 400 output tokens). Prefill scales with length. At max context (78k tokens), prefill dominates wall time: 27B takes 8.3 minutes just for prefill.

MTP on 122B Vulkan regresses across all prompt lengths (-11% vs non-MTP). The MTP overhead exceeds the compute benefit at this model size on Vulkan.

Round 3: Full Context Decode
#

The one that matters. ~76k token prompt (120 Python classes with random parameters), 5000 output tokens. Each model got a unique suffix so the KV cache never hit between models.

Qwen3.6-35B-A3B
#

ConfigDecode tok/sEmpty Context DecodeDropWall Time
ROCm non-MTP16.646.2-64%471s
ROCm MTP37.563.7-41%290s
Vulkan non-MTP28.932.7-12%317s
Vulkan MTP34.346.8-27%310s

Qwen3.6-27B
#

ConfigDecode tok/sEmpty Context DecodeDropWall Time
ROCm non-MTP6.27.7-19%1296s
ROCm MTP9.314.2-34%1032s
Vulkan non-MTP6.77.4-9%1143s
Vulkan MTP9.011.3-20%1034s

Qwen3.5-122B-A10B
#

ConfigDecode tok/sEmpty Context DecodeDropWall Time
ROCm non-MTP13.923.3-40%647s
ROCm MTP19.231.0-38%614s
Vulkan non-MTP23.726.8-12%536s
Vulkan MTP21.923.2-6%638s

All MTP configs achieved 76-78% acceptance on long outputs (vs 88-100% on short prompts).

The Decode Drop Story
#

At empty context, ROCm was 2.3x faster than Vulkan for 35B decode (46 vs 32 tok/s). At full context with 76k tokens in the KV cache, ROCm non-MTP drops to 16.6 tok/s — a 64% collapse. Vulkan only drops 12% (32.7 to 28.9).

Why: ROCm’s HIP backend pays higher overhead per KV access. With 76k tokens in the cache, each decode step requires attention over the full cache, and the ROCm path can’t keep up with the bandwidth demands. Vulkan’s shader path is simpler and more memory-efficient for this workload.

MTP recovers most of the gap. ROCm MTP at 37.5 tok/s still beats Vulkan non-MTP at 28.9. But the margin has narrowed from 2.3x to 1.3x.

What This Means for Real Work
#

If you’re running agentic sessions with long context and long outputs:

  • 35B MoE ROCm MTP (37.5 tok/s at full context) is the sweet spot for interactive chat
  • 122B Vulkan non-MTP (23.7 tok/s) is a viable quality lane if you don’t need MTP
  • 27B dense (6-9 tok/s) is not viable regardless of backend — the MoE architecture is the real speed multiplier
  • Vulkan MTP on 122B regresses (-6% at full context) — skip it and use non-MTP instead

Power and Thermal
#

MetricROCm 7.13Vulkan
Socket power74-89WN/A
Temperature48°CN/A
GFX activity78-99%N/A
GTT used (35B loaded)43GB/114GBN/A

ROCm gives better GPU utilization and lower idle overhead (157MB vs unknown for Vulkan). Power stays consistent across all workloads.

What Changed
#

  • ROCm 7.2.2 → 7.13: the therock-gfx1151 codegen path finally works. 7.2.2 could enumerate the GPU but couldn’t compile shaders.
  • MTP merged to llama.cpp main: May 16. --spec-type draft-mtp --spec-draft-n-max 2 enables it. Accept rates are 76-90% depending on output length.
  • Voyager node swap: voyager (Ryzen 9900X, RTX 5090) is now the foreground Hermes session; artemis (Strix Halo) is the parallel worker/auxiliary node.

Setup Notes
#

ROCm 7.13 + therock-gfx1151 requires the custom codegen path from kyuz0’s TheRock toolbox. Build llama.cpp with:

cmake -B build-rocm -DGGML_HIP=ON -DCMAKE_PREFIX_PATH=/home/therock-gfx1151
cmake --build build-rocm --config Release

BF16 models don’t work at full context on Strix Halo — the drivers can’t handle it. Q8 is the sweet spot for 35B, Q4 for 122B.

amd_iommu is off — it’s safe and gives a 2-6% speed boost on this hardware.

RADV unified heap is enabled via ~/.drirc. TTM pages_limit is set to 112GB.

Summary
#

ROCm is faster when the KV cache is small. Vulkan is more stable when it’s full. MTP recovers most of ROCm’s loss at full context but doesn’t help on 122B Vulkan. The MoE architecture (3B or 10B active) is the real differentiator — dense 27B is 5x slower than 35B MoE because it has 27B active params to process per token.

For my setup, ROCm 7.13 with MTP on the 35B MoE remains the production choice: 37.5 tok/s at full context, under 100W, with 262k context available.


Hardware: Minisforum MS-S1 Max, Ryzen AI MAX+ 395, 128GB LPDDR5X, Radeon 8060S iGPU (gfx1151). Software: ROCm 7.13.0a20260515, Vulkan 1.3 RADV, llama.cpp b9188. All numbers from live llama-swap proxy, not llama-bench synthetic runs. Raw data available on request.