Ticker

6/recent/ticker-posts

AMD Takes Aim at CUDA with Massive ROCm 7.0 Update

 AMD has announced a major update to its software stack with the release of ROCm 7.0, marking one of its most significant updates to date. With this new version, AMD is introducing enhanced frameworks and new algorithms, squarely aiming to build a competitive alternative to NVIDIA's dominant CUDA ecosystem.


According to AMD, the pace of AI innovation is accelerating like never before. With models scaling to hundreds of billions of parameters and inference demands soaring, enterprises are seeking scalable, high-efficiency solutions that strike a balance between cost and performance. This puts immense pressure on developers to meet these demands while maintaining flexibility, code portability, and future-proofing. The launch of ROCm 7.0 is designed to empower developers and businesses to "move faster, scale smarter, and deploy AI."

Key Features of ROCm 7.0 Include:

  • Support for the Instinct MI350 series GPUs, delivering breakthrough AI training and inference performance.

  • Seamless distributed inference across clusters with support for leading frameworks.

  • Enhanced code portability with HIP 7.0, simplifying development and migration across hardware ecosystems.

  • New enterprise-focused tools to streamline AI infrastructure management and deployment.

  • Support for large models using popular MXFP4 and FP8 formats through AMD's Quark quantization technology.

The Hardware Powering the Push: Instinct MI350

The Instinct MI350 series, built on the new CDNA 4 architecture, was first unveiled at AMD's "Advancing AI 2025" event this past June. These accelerators are built using an advanced chiplet and stacking process. Accelerator Complex Dies (XCDs) fabricated on a 3nm-class (N3P) process are stacked directly on top of I/O Dies (IODs) built on a 6nm (N6) process, using CoWoS-S packaging technology. This 3D hybrid architecture delivers exceptional performance density and power efficiency, while the interconnects between the IODs and the integration of the HBM3e memory are handled with a 2.5D design.


A single AMD Instinct MI350 series accelerator features:

  • 8 XCDs (Accelerator Complex Dies): Each contains 32 Compute Units, for a total of 256 CUs and 1024 Matrix Cores.

  • L2 Cache: 2MB of L2 cache is configured per XCD.

  • I/O & Infinity Cache: The IOD is composed of two N6 dies, providing a 128-channel HBM3e memory interface and 256MB of AMD Infinity Cache.

  • HBM3e Memory: The accelerator is equipped with two HBM3e packages. Each package utilizes an 8-stack configuration, with every stack being a 12-layer (12-Hi) die stack totaling 36GB. This configuration delivers a blistering 8 TB/s of memory bandwidth at an 8 Gbps data rate.

  • Interconnects: On-package, the Infinity Fabric AP interconnect provides 5.5 TB/s of bandwidth. For external connections, the accelerator uses a 4th Generation Infinity Fabric bus with 1075 GB/s of bandwidth, supplemented by a 128 GB/s PCIe 5.0 interface.

Post a Comment

0 Comments