AMD Instinct™ MI325X Accelerators

AMD Instinct™ MI325X Accelerators

Revolutionizing AI Performance and Scalability

The AMD Instinct™ MI325X GPU accelerators set a new benchmark in AI computing, delivering unmatched performance and efficiency for training and inference workloads.


Introducing the AMD Instinct MI325X Accelerator and Platform

3rd Generation AMD CDNA™ Core Architecture

  • Built on advanced die stacking and chiplet technology to deliver superior scalability and efficiency.

  • Purpose-designed for demanding AI workloads, providing exceptional performance for AI inferencing, training, and data analytics.

Unparalleled Memory Capacity

  • Equipped with high-capacity HBM3E memory, enabling seamless handling of massive datasets and complex computations.

  • Platforms powered by 8 AMD Instinct™ MI325X accelerators deliver an astounding 2 terabytes of memory with low latency.

  • Seamlessly scalable as a drop-in replacement for the Instinct MI300X Platform.

  • Optimizes multitasking efficiency and supports extensive AI models and virtual machines.

High-Speed Memory Bandwidth

  • Features 6 TB/s of peak memory bandwidth for accelerated data transfer and reduced latency.

  • Enhances scalability and supports high-resolution, data-intensive applications.

  • Ideal for real-time AI inferencing and advanced data processing tasks.


Built on 3rd Gen AMD CDNA Architecture

The AMD Instinct MI325X leverages the CDNA 3 architecture, featuring:

  • Enhanced AMD Matrix Core technology for improved throughput and streamlined compute performance.

  • AMD Infinity Fabric™ technology for optimized I/O efficiency, enabling seamless scaling within and across accelerators.

  • PCIe® Gen 5 interface with 16 lanes for high-speed host connections.

  • Seven Infinity Fabric links ensuring full connectivity between eight GPUs in a ring topology.

  • Pre-configured in the MI325X Platform with eight accelerators linked by the AMD Universal Base Board (UBB 2.0) featuring HGX host connectors.


Multi-Chip Architecture

The MI325X utilizes a multi-chip architecture for dense computing and high-bandwidth memory integration:

  • Eight accelerated compute dies (XCDs), each equipped with:

    • 38 Compute Units (CUs), 32 KB L1 cache per CU, and 4 MB shared L2 cache.

    • 256 MB of AMD Infinity Cache™ shared across 8 XCDs.

    • Support for multiple precisions for AI/ML and HPC tasks, including native hardware for sparsity.

  • Advanced Media Decoding: Supports HEVC/H.265, AVC/H.264, VP9, and AV1 with additional 8-core JPEG/MPEG CODEC.

  • 256 GB of HBM3E memory with 6 TB/s peak throughput for data-intensive applications.

  • SR-IOV support for up to 8 partitions, enabling virtualization and multi-user environments.


Coherent Shared Memory for Large Models

The MI325X accelerators facilitate large-scale AI and HPC workloads with:

  • Coherent shared memory between eight accelerators on a UBB.

  • 128 GB/s bidirectional bandwidth between GPUs to ensure rapid data exchange.

  • Enhanced performance for memory-intensive AI, ML, and HPC models.


Transforming AI and HPC Capabilities

The AMD Instinct MI325X accelerators redefine AI performance, enabling developers and enterprises to handle the most complex AI workloads with superior efficiency, scalability, and data processing speed. Whether it’s AI training, inference, or data analytics, the MI325X platform delivers the power and flexibility needed to drive cutting-edge AI solutions.

Contact us for pricing


Tech Specs 

Product Basics

Name

AMD Instinct™ MI325X

Family

Instinct

Series

Instinct MI300 Series

Form Factor

Servers

Launch Date

10/10/2024

GPU Specifications

GPU Architecture

CDNA3

Lithography

TSMC 5nm | 6nm FinFET

Stream Processors

19,456

Matrix Cores

1216

Compute Units

304

Peak Engine Clock

2100 MHz

Peak Eight-bit Precision (FP8) Performance (E5M2, E4M3)

2.61 PFLOPs

Peak Eight-bit Precision (FP8) Performance with Structured Sparsity (E5M2, E4M3)

5.22 PFLOPs

Peak Half Precision (FP16) Performance

1.3 PFLOPs

Peak Half Precision (FP16) Performance with Structured Sparsity

2.61 PFLOPs

Peak Single Precision (TF32 Matrix) Performance

653.7 TFLOPs

Peak Single Precision (TF32) Performance with Structured Sparsity

1.3 PFLOPs

Peak Single Precision Matrix (FP32) Performance

163.4 TFLOPs

Peak Double Precision Matrix (FP64) Performance

163.4 TFLOPs

Peak Single Precision (FP32) Performance

163.4 TFLOPs

Peak Double Precision (FP64) Performance

81.7 TFLOPs

Peak INT8 Performance

2.6 POPs

Peak INT8 Performance with Structured Sparsity

5.22 POPs

Peak bfloat16

1.3 PFLOPs

Peak bfloat16 with Strutured Sparsity

2.61 PFLOPs

Transistor Count

153 Billion

Requirements

External Power Connectors

54V UBB

Typical Board Power (TBP)

1000W Peak

GPU Memory

Last Level Cache (LLC)

256 MB

Dedicated Memory Size

256 GB

Dedicated Memory Type

HBM3E

Infinity Cache

Yes

Memory Interface

8192-bit

Memory Clock

6 GHz

Peak Memory Bandwidth

6 TB/s

Memory ECC Support

Yes (Full-Chip)

Board Specifications

GPU Form Factor

OAM Module

Bus Type

PCIe® 5.0 x16

Infinity Fabric™ Links

8

Peak Infinity Fabric™ Link Bandwidth

128 GB/s

Cooling

Passive OAM

Additional Features

Supported Technologies

AMD CDNA™ 3 Architecture , AMD ROCm™ - Ecosystem without Borders , AMD Infinity Architecture

RAS Support

Yes

Page Retirement

Yes

Page Avoidance

Yes

SR-IOV

Yes


Footnotes

  1. MI325-002 - Calculations conducted by AMD Performance Labs as of May 28th, 2024 for the AMD Instinct™ MI325X GPU resulted in 1307.4 TFLOPS peak theoretical half precision (FP16), 1307.4 TFLOPS peak theoretical Bfloat16 format precision (BF16), 2614.9 TFLOPS peak theoretical 8-bit precision (FP8), 2614.9 TOPs INT8 floating-point performance. Actual performance will vary based on final specifications and system configuration.
    Published results on Nvidia H200 SXM (141GB) GPU: 989.4 TFLOPS peak theoretical half precision tensor (FP16 Tensor), 989.4 TFLOPS peak theoretical Bfloat16 tensor format precision (BF16 Tensor), 1,978.9 TFLOPS peak theoretical 8-bit precision (FP8), 1,978.9 TOPs peak theoretical INT8 floating-point performance. BFLOAT16 Tensor Core, FP16 Tensor Core, FP8 Tensor Core and INT8 Tensor Core performance were published by Nvidia using sparsity; for the purposes of comparison, AMD converted these numbers to non-sparsity/dense by dividing by 2, and these numbers appear above
    Nvidia H200 source:  https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446 and https://www.anandtech.com/show/21136/nvidia-at-sc23-h200-accelerator-with-hbm3e-and-jupiter-supercomputer-for-2024
    Note
    : Nvidia H200 GPUs have the same published FLOPs performance as H100 products https://resources.nvidia.com/en-us-tensor-core/. MI325-02

  2. MI325-008 - Calculations conducted by AMD Performance Labs as of October 2nd, 2024 for the AMD Instinct™ MI325X (1000W) GPU designed with AMD CDNA™ 3 5nm | 6nm FinFET process technology at 2,100 MHz peak boost engine clock resulted in 163.4 TFLOPs peak theoretical double precision Matrix (FP64 Matrix), 81.7 TFLOPs peak theoretical double precision (FP64), 163.4 TFLOPs peak theoretical single precision Matrix (FP32 Matrix), 163.4 TFLOPs peak theoretical single precision (FP32), 653.7 TFLOPS peak theoretical TensorFloat-32 (TF32), 1307.4 TFLOPS peak theoretical half precision (FP16), Actual performance may vary based on final specifications and system configuration.
    Published results on Nvidia H200 SXM (141GB) GPU:   66.9 TFLOPs peak theoretical double precision tensor (FP64 Tensor), 33.5 TFLOPs peak theoretical double precision (FP64), 66.9 TFLOPs peak theoretical single precision (FP32), 494.7 TFLOPs peak TensorFloat-32 (TF32), 989.5 TFLOPS peak theoretical half precision tensor (FP16 Tensor). TF32 Tensor Core performance were published by Nvidia using sparsity; for the purposes of comparison, AMD converted these numbers to non-sparsity/dense by dividing by 2, and this number appears above.
    Nvidia H200 source:  https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446 and https://www.anandtech.com/show/21136/nvidia-at-sc23-h200-accelerator-with-hbm3e-and-jupiter-supercomputer-for-2024
    Note
    : Nvidia H200 GPUs have the same published FLOPs performance as H100 products https://resources.nvidia.com/en-us-tensor-core/.
    *Nvidia
    H200 GPUs don’t support FP32 Tensor.

  3. MI325-001A - Calculations conducted by AMD Performance Labs as of September 26th, 2024, based on current specifications and /or estimation. The AMD Instinct™ MI325X OAM accelerator will have 256GB HBM3E memory capacity and 6 TB/s GPU peak theoretical memory bandwidth performance. Actual results based on production silicon may vary.
    The highest published results on the NVidia Hopper H200 (141GB) SXM GPU accelerator resulted in 141GB HBM3E memory capacity and 4.8 TB/s GPU memory bandwidth performance.  https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446
    The
    highest published results on the NVidia Blackwell HGX B100 (192GB) 700W GPU accelerator resulted in 192GB HBM3E memory capacity and 8 TB/s GPU memory bandwidth performance.
    The highest published results on the NVidia Blackwell HGX B200 (192GB) GPU accelerator resulted in 192GB HBM3E memory capacity and 8 TB/s GPU memory bandwidth performance.
    Nvidia Blackwell specifications at https://resources.nvidia.com/en-us-blackwell-architecture?_gl=1*1r4pme7*_gcl_aw*R0NMLjE3MTM5NjQ3NTAuQ2p3S0NBancyNkt4QmhCREVpd0F1NktYdDlweXY1dlUtaHNKNmhPdHM4UVdPSlM3dFdQaE40WkI4THZBaWFVajFy