Slurm scheduler overview
Overview of the LOTUS batch scheduler, Slurm
A job or batch scheduler, is a tool that manages how user jobs are queued and run on a set of compute resources. In the case of LOTUS the compute resources are the set of compute nodes that make up the LOTUS hardware. Each user can submit jobs to the scheduler which then decides which jobs to run and where to execute them. The scheduler manages the jobs to ensure that the compute resources are being used efficiently and that users get appropriate access to those resources.
Slurm is the job scheduler deployed on JASMIN. It allows users to submit, monitor, and control jobs on the LOTUS cluster.
Before learning how to use Slurm, it is worthwhile becoming familiar with the basic principles of scheduler operation in order to get the best use out of the LOTUS cluster. Scheduler software exists simply because the amount of jobs that users wish to run on a cluster at any given time is usually greatly in excess of the amount of resources available. This means that the scheduler must queue jobs and work out how to run them efficiently.
Several factors are taken into account during scheduling, such as job duration and size, but the basic principles remain the same throughout - every user gets a fair share of the cluster based on the jobs that they have submitted. This leads to a small number of important principles:
In the example above, three users (left column) have jobs in the queue (middle column) which are waiting to run on the cluster (right column). As the blue user’s job finishes (middle row), all three users could potentially use the two job slots that become available. However, the orange and purple users already have jobs running, whereas the blue user does not, and as such it is the blue user’s jobs that are run (bottom row).
There are five standard Slurm queues (also known as “partitions” in Slurm terminology) for batch job submissions to the LOTUS
cluster: short-serial
, long-serial
, par-single
, par-multi
and high-mem
.
The default queue is short-serial
. For testing new workflows, the
additional queue test
is recommended. The specification of each queue is
described in detail in this article:
Slurm queues on LOTUS
Queues other than the five standard queues with the test queue should be ignored unless you have been specifically instructed to use them.
One of the great advantages of using JASMIN is the ability to create batch jobs that run simultaneously on multiple LOTUS nodes. However, users familiar with running interactively on a single machine often take time to adapt to this new way of working. The change involves moving from a “watching your job run” approach to “submitting your job and coming back later”.
The typical workflow for setting up and running LOTUS jobs is as follows:
Occasionally a project has a specific requirement for a collection of compute nodes that involve the provision of a project-specific queue. If you are working on such a project your project lead will provide guidance on which queue to use. Please contact the helpdesk if you are interested in setting up a project-specific queue.