GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth on many-core processors across diverse parallel programming models

Deakin, Tom and Price, James and Martineau, Matt and McIntosh-Smith, Simon

Performance Portable Programming Models for Accelerators Workshop held in conjuction with International Supercomputing Conference (P3MA), 2016

Abstract

Many scientific codes consist of memory bandwidth bound kernels — the dominating factor of the runtime is the speed at which data can be loaded from memory into the Arithmetic Logic Units, before results are written back to memory. One major advantage of many-core devices such as General Purpose Graphics Processing Units (GPGPUs) and the Intel Xeon Phi is their focus on providing increased memory bandwidth over traditional CPU architectures. However, as with CPUs, this peak memory bandwidth is usually unachievable in practice and so benchmarks are required to measure a practical upper bound on expected performance. The choice of one programming model over another should ideally not limit the performance that can be achieved on a device. GPU-STREAM has been updated to incorporate a wide variety of the latest parallel programming models, all implementing the same parallel scheme. As such this tool can be used as a kind of Rosetta Stone which provides both a cross-platform and cross-programming model array of results of achievable memory bandwidth.

@inproceedings{p3ma16,
  author = {Deakin, Tom and Price, James and Martineau, Matt and McIntosh-Smith, Simon},
  title = {{GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth on many-core processors across diverse parallel programming models}},
  booktitle = {{Performance Portable Programming Models for Accelerators Workshop held in conjuction with International Supercomputing Conference (P3MA)}},
  year = {2016},
  publisher = {{Springer, Cham}},
  doi = {10.1007/978-3-319-46079-6_34},
  keywords = {Conferences and Workshops}
}