Not all computational problems utilize the types of parallel applications traditionally designed to run on high-performance computing (HPC) systems. Today, many workloads running on these systems often require a modest amount of computing resources for any given job or task. For specific research workloads, however, a more important consideration is how much aggregate compute power can be consistently and reliably leveraged against a problem over time. These high-throughput computing (HTC) workloads aim to solve larger problems over extended periods by completing numerous smaller computational subtasks. For example, these often involve significant parameter sweeps over simulation input parameters or regular processing and analysis of data collected from specialized instruments. In some cases, these problems are also composed of numerous district computational subtasks linked together in highly structured, complex workflows, which can become a challenge in and of themselves to design and manage effectively. If your research problem can leverage a high-throughput or many-task computing (MTC) model, then learning how to build and run these types of workflows safely and effectively on HPC systems is vital.
In this third part of our series on Batch Computing, we introduce you to high-throughput and many-task computing using the Slurm Workload Manager. In particular, you will learn how to use Slurm job arrays and job dependencies, which can be used to create these more structured computational workflows. We will also highlight some problems you’ll likely encounter when you start running HTC and/or MTC workloads on HPC systems. This will include a discussion on job bundling strategies — what they are and when to use them. Additional topics about high-throughput and many-task computing workflows will be covered as time permits.
---
COMPLECS (COMPrehensive Learning for end-users to Effectively utilize CyberinfraStructure) is a new SDSC program where training will cover non-programming skills needed to effectively use supercomputers. Topics include parallel computing concepts, Linux tools and bash scripting, security, batch computing, how to get help, data management and interactive computing. Each session offers 1 hour of instruction followed by a 30-minute Q&A. COMPLECS is supported by NSF award 2320934.