Dr Tom Deakin is a Senior Research Associate in the High Performance Computing Research Group at the University of Bristol, led by Professor Simon McIntosh-Smith. Tom’s work includes researching the performance portability of massively parallel High Performance simulation codes. Tom is the Chair of the Khronos SYCL Advisory Panel and member of the SYCL Working Group. Tom completed his PhD in Leveraging Many-Core Technology for Deterministic Neutral Particle Transport at Extreme Scale in 2018, and has since continued working on unstructured mesh transport and performance portability on HPC architectures. In 2012, Tom graduated top of his class from the University of Bristol with first class honours with a MSc in Mathematics and Computer Science, winning the prize for Best graduating student in Computer Science and Mathematics. Tom was a lecturer for Introduction to Computer Architecture and taught an Introduction to High Performance Computing at the University of Bristol, and has given tutorials on parallel programming models at international conferences and private courses on OpenCL and OpenMP. Training material on OpenMP for Computational Scientists and HandsOnOpenCL can be found online.
PhD in High Performance Computing, 2018
University of Bristol
Postgraduate Artists Diploma in Trumpet Performance, 2014
Trinity Laban Conservatoire of Music and Dance
MSci in Mathematics and Computer Science, 2012
University of Bristol
The phrase “performance portability” is commonly used, but may mean different things to different people. Developing a better appreciation of the needs of different software developers and a framework for talking about these needs improves our ability to define goals, design experiments and make forward progress. This article discusses a methodology for quantifying, summarizing, visualizing, and understanding application performance portability and programmer productivity.
In recent years the computer processors underpinning the large, distributed, workhorse computers used to solve the Boltzmann transport equation have become ever more parallel and diverse. Traditional CPU architectures have increased in core count, reduced in clock speed and gained a deep memory hierarchy. Multiple processor vendors offer a collectively diverse range of both CPUs and GPUs, with the architectures used in the fastest machines in the world ever growing in diversity of many-core architectures. Going forward, the landscape of processor technologies will require our codes to function well across multiple architectures. This ever increasing range of architectures represents a unique challenge for solving the Boltzmann equation using deterministic methods in particular, and so it is important to characterize the performance of those key algorithms across the processor spectrum. The solution of the transport equation is computationally expensive, and so we require well optimized and highly parallel solver implementations in order to solve interesting problems quickly. In this work we explore the performance profiles of deterministic SN transport sweeps for both 3D structured (Cartesian) and unstructured (hexahedral) meshes. The study focuses on the characteristics of computational performance which are responsible for the actual performance of a transport solver.
Many scientific codes consist of memory bandwidth bound kernels. One major advantage of many-core devices such as general purpose graphics processing units (GPGPUs) and the Intel Xeon Phi is their focus on providing increased memory bandwidth over traditional CPU architectures. Peak memory bandwidth is usually unachievable in practice and so benchmarks are required to measure a practical upper bound on expected performance. We augment the standard STREAM kernels with a dot product kernel to investigate the performance of simple reduction operations on large arrays. The choice of programming model should ideally not limit the achievable performance on a device. BabelStream (formally GPU-STREAM) has been updated to incorporate a wide variety of the latest parallel programming models, all implementing the same parallel scheme. As such this tool can be used as a kind of Rosetta Stone which provides both a cross-platform and cross-programming model array of results of achievable memory bandwidth.