Dr Tom Deakin is a Lecturer and Senior Research Associate in the High Performance Computing Research Group at the University of Bristol, led by Professor Simon McIntosh-Smith. Tom completed his PhD in Leveraging Many-Core Technology for Deterministic Neutral Particle Transport at Extreme Scale in 2018, and has since continued working on unstructured mesh transport and performance portability on HPC architectures. In 2012, Tom graduated top of his class from the University of Bristol with first class honours with a MSc in Mathematics and Computer Science, winning the prize for Best graduating student in Computer Science and Mathematics. Tom has been involved in teaching an Introduction to High Performance Computing at the University of Bristol and helped with tutorials on parallel programming models at international conferences and private courses on OpenCL and OpenMP. Training material on OpenMP for Computational Scientists and HandsOnOpenCL can be found online.
PhD in High Performance Computing, 2018
University of Bristol
Postgraduate Artists Diploma in Trumpet Performance, 2014
Trinity Laban Conservatoire of Music and Dance
MSci in Mathematics and Computer Science, 2012
University of Bristol
In recent years the computer processors underpinning the large, distributed, workhorse computers used to solve the Boltzmann transport equation have become ever more parallel and diverse. Traditional CPU architectures have increased in core count, reduced in clock speed and gained a deep memory hierarchy. Multiple processor vendors offer a collectively diverse range of both CPUs and GPUs, with the architectures used in the fastest machines in the world ever growing in diversity of many-core architectures. Going forward, the landscape of processor technologies will require our codes to function well across multiple architectures. This ever increasing range of architectures represents a unique challenge for solving the Boltzmann equation using deterministic methods in particular, and so it is important to characterize the performance of those key algorithms across the processor spectrum. The solution of the transport equation is computationally expensive, and so we require well optimized and highly parallel solver implementations in order to solve interesting problems quickly. In this work we explore the performance profiles of deterministic SN transport sweeps for both 3D structured (Cartesian) and unstructured (hexahedral) meshes. The study focuses on the characteristics of computational performance which are responsible for the actual performance of a transport solver.
Previous studies into performance portability have typically analysed a single application (and its various implementations) in isolation. In this study we explore the wider landscape of performance portability by considering a number of applications from across the space of dwarfs, written in multiple parallel programming models, and across a diverse set of architectures. We apply rigorous performance portability metrics, as defined by Pennycook et al. We believe this is the broadest and most rigorous performance portability study to date, representing a far reaching exploration of the state of performance portability that is achievable today. We will present a summary of the performance portability of each application and programming model across our diverge range of twelve computer architectures, including six different server CPUs from five different vendors, five different GPUs from two different vendors, and one vector architecture. We will conclude with an analysis of the performance portability of key programming models in general, across different application spaces as well across differing architectures, allowing us to comment on more general performance portability principles.
Many scientific codes consist of memory bandwidth bound kernels. One major advantage of many-core devices such as general purpose graphics processing units (GPGPUs) and the Intel Xeon Phi is their focus on providing increased memory bandwidth over traditional CPU architectures. Peak memory bandwidth is usually unachievable in practice and so benchmarks are required to measure a practical upper bound on expected performance. We augment the standard STREAM kernels with a dot product kernel to investigate the performance of simple reduction operations on large arrays. The choice of programming model should ideally not limit the achievable performance on a device. BabelStream (formally GPU-STREAM) has been updated to incorporate a wide variety of the latest parallel programming models, all implementing the same parallel scheme. As such this tool can be used as a kind of Rosetta Stone which provides both a cross-platform and cross-programming model array of results of achievable memory bandwidth.