Towards Efficient Load Balancing BFS on GPUs: One Code for AMD, Intel & Nvidia
Twelfth Workshop on Accelerator Programming and Directives (WACCPD), 2025
Abstract
Efficient graph processing is essential for a wide range of applications. Scalability and memory access patterns are still a challenge, especially with the Breadth-First Search algorithm. This work focuses on leveraging multi-GPU HPC nodes with peer-to-peer support of the Intel oneAPI implementation of SYCL. We propose three GPU-based load-balancing methods: work-group localisation for efficient data access, even workload distribution for higher GPU occupancy, and a hybrid strided-access approach for heuristic balancing. These methods ensure performance, portability, and productivity with a unified codebase. Our proposed methodologies outperform state-of-the-art single-GPU implementations based on CUDA on synthetic RMAT graphs. We analysed BFS performance across NVIDIA A100, Intel Max 1550, and AMD MI300X GPUs, achieving a peak performance of 153.27 GTEPS on an RMAT25-64 graph using 8 GPUs on the NVIDIA A100. Furthermore, our work handles RMAT graphs up to scale 29, achieving superior performance on synthetic graphs and competitive results on real-world datasets.
@inproceedings{waccpd25,
author = {Olgu, Kaan and Kenter, Tobias and Nunez-Yanez, Jose and McIntosh-Smith, Simon and Deakin, Tom},
title = {{Towards Efficient Load Balancing BFS on GPUs: One Code for AMD, Intel \& Nvidia}},
year = {2025},
booktitle = {{Twelfth Workshop on Accelerator Programming and Directives (WACCPD)}},
keywords = {Conferences and Workshops}
}