An Evaluation of the Fujitsu A64FX for HPC Applications

Poenaru, Andrei and Deakin, Tom and McIntosh-Smith, Simon and Hammond, Simon D. and Younge, Andrew J.

Cray User Group (CUG), 2021

Abstract

Recent generations of supercomputers have adopted different strategies in their attempts to remain competitive in the race to Exascale. In most cases, they rely on accelerators such as GPUs to deliver high arithmetic performance and memory bandwidth. But accelerators come with their own challenges due to their programming models, which can be hard for applications to exploit. The current leader in the TOP500 list, the Fugaku system in Japan, has chosen a different route: instead of offloading to accelerators, this system relies on a new generation of general-purpose CPUs to deliver GPU-class performance while maintaining the ease of use of a traditional CPU. This is the Fujitsu A64FX, a design purpose-built for high-performance computing (HPC) based on the Arm AArch64 architecture. It is able to deliver up to 1 TB/s of memory bandwidth by using the same HBM2 technology found in top-end GPUs, and it offers 512-bit-wide vectors through the Scalable Vector Extension (SVE). It is the first CPU to integrate either HBM2 or SVE. In this paper we evaluated the performance of the A64FX processor on a range of common scientific workloads. We used compute-bound and memory-bandwidth-bound mini-apps, and widely utilised full-scale scientific applications. These benchmarks have been successfully used in the past to quantify performance characteristics in other emerging HPC processors, such as the Arm-based Marvell ThunderX2 and the many-core Intel Xeon Phi. As part of this evaluation, we looked not only at raw application performance, but also at the maturity of the tools available for the A64FX. We uniquely compared all four major HPC compilers that can target the A64FX, including Cray, GNU, Arm and Fujitsu’s own compiler. We found the A64FX to be a strong competitor to mainstream HPC processors. In memory-bandwidth-bound benchmarks, it exceeded 800 GB/s and delivered more than twice the performance of a top-end Xeon or ThunderX2 dual-socket node. We observed particularly good vectorisation performance from the Fujitsu compiler, which was also able to further tune the code for this microarchitecture through techniques such as software pipelining.

@inproceedings{cug21,
  author = {Poenaru, Andrei and Deakin, Tom and McIntosh-Smith, Simon and Hammond, Simon D. and Younge, Andrew J.},
  title = {{An Evaluation of the Fujitsu A64FX for HPC Applications}},
  booktitle = {{Cray User Group (CUG)}},
  year = {2021},
  pdf = {https://cug.org/proceedings/cug2021_proceedings/includes/files/pap122s2-file1.pdf},
  keywords = {Conferences and Workshops}
}