Summary:
1. Zyphra, AMD, and IBM collaborated to test AMD’s GPUs for large-scale AI model training, resulting in the creation of ZAYA1.
2. ZAYA1 is a Mixture-of-Experts model built entirely on AMD GPUs and networking, offering a viable alternative to NVIDIA for scaling AI.
3. The model was trained on AMD’s Instinct MI300X chips, Pensando networking, and ROCm software on IBM Cloud infrastructure, showcasing competitive performance and cost-effectiveness.
Article:
Zyphra, in conjunction with AMD and IBM, embarked on a year-long endeavor to evaluate the capabilities of AMD’s GPUs and platform for supporting large-scale AI model training. The culmination of their efforts is ZAYA1, a groundbreaking Mixture-of-Experts foundation model that challenges the industry’s reliance on NVIDIA for scaling AI operations.
The collaborative effort saw ZAYA1 being trained on AMD’s Instinct MI300X chips, Pensando networking, and ROCm software, all deployed on IBM Cloud infrastructure. Noteworthy is the conventional setup employed by Zyphra, resembling an enterprise cluster but devoid of NVIDIA components. This approach signifies a significant milestone in providing a viable second option for businesses seeking to expand their AI capacity without compromising on performance.
ZAYA1’s performance is reported to be on par with, and in some aspects surpassing, established open models in reasoning, mathematics, and coding. The model’s architecture, leveraging compressed attention, refined routing systems, and residual scaling, showcases its capability to compete with larger peers such as Qwen3-4B and Gemma3-12B. Additionally, the model’s Mixture-of-Experts structure enables efficient memory management during inference and reduces serving costs.
The implementation of AMD GPUs posed challenges in adapting mature NVIDIA-based workflows to ROCm. Zyphra meticulously optimized model dimensions, GEMM patterns, and microbatch sizes to align with the preferred compute ranges of the MI300X GPUs. Storage considerations were also addressed to enhance performance, ensuring efficient training runs and streamlined operations.
Maintaining the integrity of training clusters over extended periods presented challenges, which Zyphra mitigated through its Aegis service. By monitoring logs and system metrics, the team swiftly identified and rectified issues, enhancing job uptime and reducing operational burden. Distributed checkpointing further improved efficiency, enabling faster saves and ensuring uninterrupted training rhythm.
The ZAYA1 AMD training milestone underscores the maturity of AMD’s ecosystem for large-scale model development, offering a compelling alternative to NVIDIA. While transitioning entirely from NVIDIA clusters may not be practical, leveraging AMD for specific stages can enhance memory capacity and training volume without significant disruption. In conclusion, organisations can benefit from adopting a flexible approach to AI procurement, leveraging diverse vendor offerings to optimize performance and scalability in AI operations.