SC23 full-day tutorial: Hands-on Practical Hybrid Parallel Application Performance Engineering (Denver, CO, USA)

SC23 Program Entry Location: room 402, Colorado Convention Center

Date

Monday 13th November 2023

Presenters

Markus Geimer, Jülich Supercomputing Centre
Sameer Shende, University of Oregon
Bert Wesarg, Technische Universität Dresden
Brian Wylie, Jülich Supercomputing Centre

Logistics

This page will be updated as information becomes available, so check back before traveling to attend the tutorial. Tutorials are planned to be live-streamed as part of the SC23 Digital Experience, however, remote participants will not receive assistance for hands-on parts. The currently available software and exercises are being updated in preparation for the tutorial.

The full-day hands-on tutorial takes place as part of the SC23 conference scheduled in room 402 of the Colorado Convention Center, Denver, CO, USA. Registration via the conference website is possible for the tutorial with or without including the conference technical program, exhibition and workshops.

Hands-on exercises will use accounts provided by Jülich Supercomputing Centre (JSC) on the JUWELS-Booster modular supercomputer to build and run an MPI+CUDA example code on two compute nodes each with dual AMD EPYC 7402 24-core 'Rome' CPUs and quad Nvidia A100 'Ampere' GPUs, measuring and analysing intra-node and inter-node performance with VI-HPS tools. Access will be via the Jupyter-JSC service allowing an Xpra remote graphical desktop environment to run within common web browsers. Tutorial participants are expected to use their own notebook computers, connecting via the SC conference wireless network, but no additional software needs to be installed.

Tutorial participants are strongly encouraged to (pre)register for a JUDOOR account to access the training project and its allocation on JUWELS-Booster.
(Note that the SC23 tutorial on Distributed GPU Programming which will also use this system is scheduled to run concurrently and will use a different training project.)

Abstract

This tutorial presents state-of-the-art performance tools for leading-edge HPC systems founded on the community-developed Score-P instrumentation and measurement infrastructure, demonstrating how they can be used for performance engineering of effective scientific applications based on standard MPI, OpenMP, hybrid MPI+OpenMP, and increasingly common usage of accelerators. Parallel performance evaluation tools from the VI-HPS (Virtual Institute - High Productivity Supercomputing) are introduced and featured in hands-on exercises with Scalasca, Vampir and TAU. We present the complete workflow of performance engineering, including instrumentation, measurement (profiling and tracing, timing and PAPI hardware counters), data storage, analysis, and visualization. Emphasis is placed on how tools are used in combination for identifying performance problems and investigating optimization alternatives. Using their own notebook computers participants will conduct exercises on quad-A100 GPU nodes of the JUWELS-Booster modular supercomputer. This will help to prepare participants to locate and diagnose performance bottlenecks in their own parallel programs.

Programme (tentative)

08:30	Introduction & basic measurement [15] Welcome & Introduction to VI-HPS [Wylie] [30] Introduction to parallel application engineering [Geimer] [15] Setup for hands-on exercises with Jupyter-JSC & JUWELS-Booster [Wylie] [30] Instrumentation & measurement of applications with Score-P [Wesarg]
10:00	(break)
10:30	Profile analyses [30] Exploration & visualization of call-path profiles with CUBE [Wylie] [30] Configuration & customization of Score-P measurements [Wesarg] [30] Examination & visualization of profiles with TAU [Shende]
12:00	(lunch)
13:30	Trace analyses [15] Recap of exercise setup and collection of traces with Score-P [Wylie] [45] Interactive visualization and time-interval statistics with Vampir [Wesarg] [30] Automated analysis of traces for inefficiencies with Scalasca [Geimer]
15:00	(break)
15:30	Further steps [15] Performance data management with TAU PerfExplorer [Shende] [30] Specialized Score-P measurements and analyses [Wesarg] [30] Finding typical parallel performance bottlenecks [Wesarg] [15] Review & conclusion [Geimer]
17:00	(adjourn)