Immediately at Nvidia’s developer convention, GTC 2024, the corporate revealed its subsequent GPU, the B200. The B200 is able to delivering 4 occasions the coaching efficiency, as much as 30 occasions the inference efficiency, and as much as 25 occasions higher vitality effectivity, in comparison with its predecessor, the Hopper H100 GPU. Primarily based on the brand new Blackwell structure, the GPU could be mixed with the corporate’s Grace CPUs to kind a brand new technology of DGX SuperPOD computer systems able to as much as 11.5 billion billion floating level operations (exaflops) of AI computing utilizing a brand new, low-precision quantity format.
“Blackwell is a brand new class of AI superchip,” says Ian Buck, Nvidia’s vice chairman of high-performance computing and hyperscale. Nvidia named the GPU structure for mathematician David Harold Blackwell, the primary Black inductee into the U.S. Nationwide Academy of Sciences.
The B200 consists of about 1600 sq. millimeters of processor on two silicon dies which can be linked in the identical package deal by a ten terabyte per second connection, so that they carry out as in the event that they had been a single 208-billion-transistor chip. These slices of silicon are made utilizing TSMC’s N4P chip know-how, which gives a 6 % efficiency enhance over the N4 know-how used to make Hopper structure GPUs, just like the H100.
Like Hopper chips, the B200 is surrounded by high-bandwidth reminiscence, more and more essential to decreasing the latency and vitality consumption of enormous AI fashions. B200’s reminiscence is the newest selection, HBM3e, and it totals 192 GB (up from 141 GB for the second technology Hopper chip, H200). Moreover, the reminiscence bandwidth is boosted to eight terabytes per second from the H200’s 4.8 TB/s.
Smaller Numbers, Sooner Chips
Chipmaking know-how did a few of the job in making Blackwell, however its what the GPU does with the transistors that basically makes the distinction. In explaining Nvidia’s AI success to pc scientists final 12 months at IEEE Scorching Chips, Nvidia chief scientist Invoice Dally mentioned that almost all got here from utilizing fewer and fewer bits to characterize numbers in AI calculations. Blackwell continues that development.
It’s predecessor structure, Hopper, was the primary occasion of what Nvidia calls the transformer engine. It’s a system that examines every layer of a neural community and determines whether or not it may very well be computed utilizing lower-precision numbers. Particularly, Hopper can use floating level quantity codecs as small as 8 bits. Smaller numbers are quicker and extra vitality environment friendly to compute, require much less reminiscence and reminiscence bandwidth, and the logic required to do the mathematics takes up much less silicon.
“With Blackwell, we’ve taken a step additional,” says Buck. The brand new structure has models that do matrix math with floating level numbers simply 4 bits vast. What’s extra, it will probably resolve to deploy them on elements of every neural community layer, not simply whole layers like Hopper. “Getting right down to that degree of high quality granularity is a miracle in itself,” says Buck.
NVLink and Different Options
Among the many different architectural insights Nvidia revealed about Blackwell are that it incorporates a devoted “engine” dedicated to the GPU’s reliability, availability, and serviceability. In response to Nvidia, it makes use of an AI-based system to run diagnostics and forecast reliability points, with the goal of accelerating up time and serving to huge AI techniques run uninterrupted for weeks at a time, a interval usually wanted to coach massive language fashions.
Nvidia additionally included techniques to assist hold AI fashions safe and to decompress knowledge to hurry database queries and knowledge analytics.
Lastly, Blackwell incorporates Nvidia’s fifth technology pc interconnect know-how NVLink, which now delivers 1.8 terabytes per second bidirectionally between GPUs and permits for high-speed communication amongst as much as 576 GPUs. Hopper’s model of NVLink might solely attain half that bandwidth.
SuperPOD and Different Computer systems
NVLink’s bandwidth is vital to constructing large-scale computer systems from Blackwell, able to crunching by way of trillion-parameter neural community fashions.
The bottom computing unit is named the DGX GB200. Every of these embody 36 GB200 superchips. These are modules that embody a Grace CPU and two Blackwell GPUs, all linked along with NVLink.
The Grace Blackwell superchip is 2 Blackwell GPUs and a Grace CPU in the identical module.Nvidia
Eight DGX GB200s could be linked additional through NVLINK to kind a 576-GPU supercomputer known as a DGX SuperPOD. Nvidia says such a pc can blast by way of 11.5 exaflops utilizing 4-bit precision calculations. Methods of tens of hundreds of GPUs are potential utilizing the corporate’s Quantum Infiniband networking know-how.
The corporate says to anticipate SuperPODs and different Nvidia computer systems to develop into accessible later this 12 months.
