Dr Tom Deakin is a Senior Research Associate in the High Performance Computing Research Group at the University of Bristol, led by Professor Simon McIntosh-Smith. Tom is the Chair of the Khronos SYCL Advisory Panel and member of the SYCL Working Group. Tom was a lecturer for Introduction to Computer Architecture at Bristol University in Autumn 2020. Tom completed his PhD in Leveraging Many-Core Technology for Deterministic Neutral Particle Transport at Extreme Scale in 2018, and has since continued working on unstructured mesh transport and performance portability on HPC architectures. In 2012, Tom graduated top of his class from the University of Bristol with first class honours with a MSc in Mathematics and Computer Science, winning the prize for Best graduating student in Computer Science and Mathematics. Tom has been involved in teaching an Introduction to High Performance Computing at the University of Bristol and given tutorials on parallel programming models at international conferences and private courses on OpenCL and OpenMP. Training material on OpenMP for Computational Scientists and HandsOnOpenCL can be found online.
PhD in High Performance Computing, 2018
University of Bristol
Postgraduate Artists Diploma in Trumpet Performance, 2014
Trinity Laban Conservatoire of Music and Dance
MSci in Mathematics and Computer Science, 2012
University of Bristol
Recent work has introduced a number of tools and techniques for reasoning about the interplay between application performance and portability, or “performance portability”. These tools have proven useful for setting goals and guiding high-level discussions, but our understanding of the performance portability problem remains incomplete. Different views of the same performance efficiency data offer different insights into an application’s performance portability (or lack thereof): standard statistical measures such as the mean and standard deviation require careful interpretation, and even metrics designed specifically to measure performance portability may obscure differences between applications. This paper offers a critical assessment of existing approaches for summarizing performance efficiency data across different platforms, and proposes visualization as a means to extract useful information about the underlying distribution. We explore a number of alternative visualizations, outlining a new methodology that enables developers to reason about the performance portability of their applications and how it might be improved. This study unpicks what it might mean to be “performance portable” and provides useful tools to explore that question.
In recent years the computer processors underpinning the large, distributed, workhorse computers used to solve the Boltzmann transport equation have become ever more parallel and diverse. Traditional CPU architectures have increased in core count, reduced in clock speed and gained a deep memory hierarchy. Multiple processor vendors offer a collectively diverse range of both CPUs and GPUs, with the architectures used in the fastest machines in the world ever growing in diversity of many-core architectures. Going forward, the landscape of processor technologies will require our codes to function well across multiple architectures. This ever increasing range of architectures represents a unique challenge for solving the Boltzmann equation using deterministic methods in particular, and so it is important to characterize the performance of those key algorithms across the processor spectrum. The solution of the transport equation is computationally expensive, and so we require well optimized and highly parallel solver implementations in order to solve interesting problems quickly. In this work we explore the performance profiles of deterministic SN transport sweeps for both 3D structured (Cartesian) and unstructured (hexahedral) meshes. The study focuses on the characteristics of computational performance which are responsible for the actual performance of a transport solver.
Many scientific codes consist of memory bandwidth bound kernels. One major advantage of many-core devices such as general purpose graphics processing units (GPGPUs) and the Intel Xeon Phi is their focus on providing increased memory bandwidth over traditional CPU architectures. Peak memory bandwidth is usually unachievable in practice and so benchmarks are required to measure a practical upper bound on expected performance. We augment the standard STREAM kernels with a dot product kernel to investigate the performance of simple reduction operations on large arrays. The choice of programming model should ideally not limit the achievable performance on a device. BabelStream (formally GPU-STREAM) has been updated to incorporate a wide variety of the latest parallel programming models, all implementing the same parallel scheme. As such this tool can be used as a kind of Rosetta Stone which provides both a cross-platform and cross-programming model array of results of achievable memory bandwidth.