Explora Phase II Beta est maintenant en ligne - la découverte de matériel de formation est désormais disponible.

Remarque : Toutes les heures sont affichées selon le fuseau horaire dans lequel l’événement a lieu.

Date: 8 avril 2026, 13:00 - 14:00

Fuseau horaire: heure d’été de l’Est nord-américain

Langue d'enseignement: Anglais

Topic: "Mixing diverse workloads within a single job: a heterogeneous scheduling primer"

Speaker: Sergey Mashchenko, SHARCNET

Registration link

Recording

--- 

Most of the National systems' users are perfectly happy with our "vanilla" job scheduling setup: one requests a particular type (CPU cores, GPUs or MIGs, large memory) and amount of resources, and then lets the job consume those resources over the specified job's runtime. In most cases, this results in reasonably good job efficiency, and reasonably short queue wait time. A sizeable minority of the jobs do not fit this homogeneous job model well, and end up wasting resources and staying long in the job queue - because of the diverse workloads within the job. In more extreme cases, some of these jobs cannot even run on our clusters. An example: a large (say 1000 CPU cores) MPI job, where the typical memory per rank requirement is modest (say 4GB), but the rank 0 frequently needs much larger memory (say, 12GB). If using the simple (homogeneous) job model, one would need to request the large amount of memory per CPU core (12GB) for all the ranks, which will result in a long queue wait time, and a waste of the memory or CPU cores. Another example: one component in the job needs GPUs, while the other component doesn't (it needs lots of CPU cores instead). Again, trying to fit such a workload into a homogeneous job will result in longer wait time, and a significant waste of the resources (most likely GPUs).

The proper way to handle diverse workloads within a single job is by using the heterogeneous job scheduling mode of SLURM (our scheduler). This scheduler feature has been available for some time now, but it wasn't widely advertised for our users as we were testing it out. In this webinar I will walk you through the whys and hows of heterogeneous job scheduling, and will provide some specific job script examples. The webinar has no prerequisites, but some experience with job submission on our clusters would be a plus.

Webinar registration is required. Need help attending a webinar? See the SHARCNET Help Wiki.

Mots-clés: Programming, Parallel, HPC


Activity log