Nvidia Blackwell GPU Architecture Overview

1

Blackwell uses a 750 mm² GB202 die with 92.2 billion transistors and 192 SMs for record compute scale.

2

It employs a 1:16 SM to GPC ratio to scale SM count cheaply, but small dispatches may underutilize the GPU.

3

Blackwell removes subchannel switches to overlap graphics and compute workloads on the same queue.

4

The SM frontend uses 128-bit fixed-length instructions with 32 KB L0 and approximately 128 KB L1 instruction caches per SM.

5

Each SM partition has a single 32-wide pipe for FP32 and INT32 operations, and supports up to 12 active wave slots.

6

Blackwell doubles per-SM ray tracing intersection rates and includes opacity micromap support.

7

SMs share a 128 KB L1/Shared Memory block with 128 B/cycle bandwidth, totaling 24 MB across the GPU.

8

The L2 cache is split into 64 banks for around 8.7 TB/s bandwidth and is paired with a 512-bit GDDR7 memory bus.

9

In compute tests like FluidX3D, Blackwell outperforms AMD RDNA4 RX 9070 due to higher SM count and memory bandwidth.

10

Nvidia holds the top consumer GPU position with Blackwell, as AMD and Intel currently lack comparable high-end GPUs.