Software

2 minute read

Where Tensor-Parallel Inference Hits the NVLink Wall

May 31, 2026

Where tensor-parallel inference hits the NVLink wall

2026-05-31 · GPU / distributed systems

Tensor parallelism splits each layer across GPUs, so every forward pass pays for an
all-reduce over the network fabric. On a single node that fabric is NVLink/NVSwitch — and
how close you get to its theoretical budget decides whether TP helps or hurts. This post
measures it on 4× H100 and explains where the wall is.

Repo with the full harness and CSVs:
nccl-collectives-bench.

What was measured

A bandwidth sweep (message size 8 B → 8 GB) of the three collectives that bound distributed
LLM work — all-reduce, all-gather, reduce-scatter — driving the canonical
nvidia/nccl-tests and adding a parser + analysis layer on top. The headline number:

All-reduce bus bandwidth ≈ 366 GB/s, about 77 % of the per-GPU NVLink uni-directional
budget on this box. That 77 % is the practical ceiling TP communication runs into; the
remaining gap is protocol overhead and the algorithm’s traffic multiplier.
Algorithm ranking at large messages: NVLS > Ring > Tree. NVLink SHARP (NVLS) offloads
the reduction into the switch, which is why it pulls ahead once messages are big enough to
amortise setup.
A protocol study (Simple / LL / LL128) showing the small-message latency floor — the
regime that actually matters for decode, where each token’s all-reduce is tiny.

Why it matters for inference

Training all-reduces gradients on big tensors, so it lives in the bandwidth-bound regime
where 366 GB/s is good news. Decode is the opposite: one token at a time means small
messages, so you’re pinned against the latency floor, not the bandwidth ceiling. That is the
real “TP wall” — past a certain TP degree, the per-token all-reduce latency dominates and
adding GPUs makes decode slower, not faster.

The repo also includes an eager-vs-CUDA-Graph comparison of that decode latency wall:
capturing the per-token step as a graph removes launch overhead that would otherwise be
indistinguishable from communication cost — a reminder to measure the right thing before
blaming the fabric.

Takeaway

“Use tensor parallelism” is not free advice. Measure the all-reduce on your fabric, know
your 77 %, and know that the number that decides decode latency is the small-message floor —
not the big-message bandwidth everyone quotes.

→ Methodology, raw CSVs, and the roofline analysis:
github.com/waynehacking8/nccl-collectives-bench

How to Share Client Links Safely: Custom URLs, Passwords, and Expiration Dates

May 31, 2026

AI - Artificial-Intelligence

Making sense of the debate over AI psychosis

May 31, 2026

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

I Love Fragrances, So I Built a 6-Game Arcade + Concierge About My Obsession

Reengineering Quality for the AI Era

Generate TypeScript Types from JSON (and where the auto-generators trip up)

Trending Tags

Where Tensor-Parallel Inference Hits the NVLink Wall

Where tensor-parallel inference hits the NVLink wall

What was measured

Why it matters for inference

Takeaway

Leave a Reply Cancel reply

Previous Post

How to Share Client Links Safely: Custom URLs, Passwords, and Expiration Dates

Next Post

Making sense of the debate over AI psychosis

Where Tensor-Parallel Inference Hits the NVLink Wall

Where tensor-parallel inference hits the NVLink wall

What was measured

Why it matters for inference

Takeaway

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts