Slurm queues
Slurm queues/partitions for batch job submissions to the LOTUS & ORCHID clusters
The Slurm queues in the LOTUS cluster are:
standard
debug
Each queue is has attributes of run-length limits (e.g. short, long) and resources. A full breakdown of each queue and its associated resources, such as run time limits and memory limits, is shown below in Table 1.
Queues represent a set of pending jobs, lined up in a defined order, and waiting for their opportunity to use resources. The queue is specified in the job script file using Slurm scheduler directive like this:
#SBATCH -p <partition=queue_name>
where <queue_name>
is the name of the queue/partition (Table 1, column 1).
Table 1: LOTUS/Slurm queues and their specifications
Queue name | Max run time | Default run time | Default memory per CPU |
---|---|---|---|
standard |
24 hrs | 1hr | 1GB |
debug |
1 hr | 30 mins | 1GB |
Note 1: Resources requested by a job must be within the resource allocation limits of the selected queue.
Note 2: If your job exceeds the default maximum run time limit then it will be terminated by the Slurm scheduler.
The Slurm command sinfo
reports the state of queues and nodes
managed by Slurm. It has a wide variety of filtering, sorting, and formatting
options.
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
...
standard* up 1-00:00:00 262 idle host[1004-1276]
debug* up 1:00:00 3 idle host[1001-1003]
...
standard
and debug
, should be ignored
as they implement different job scheduling and control policies.
sinfo
output field description:
By default, the Slurm command sinfo
displays the following information:
*
for the default queue/partition.days-hours:minutes:seconds
, e.g. 2-00:00:00
is two days maximum runtime limit.idle
means that the node is not allocated to any jobs and is available for use.The sinfo
example below, reports more complete information about the
partition/queue debug
:
sinfo --long --partition=debug
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST
debug up 1:00:00 1-infinite no NO all 3 idle host[1001-1003]
Queues/partitions are further divided up into Quality of Services (QoS), which determine further restrictions about your job, for example, how long it can run or how many CPU cores it can use.
Different partitions on LOTUS have different allowed QoS as shown below:
Partition | Allowed QoS |
---|---|
standard |
standard , short , long , high |
debug |
debug |
A summary of the different QoS are below:
QoS | Priority | Max CPUs per job | Max wall time |
---|---|---|---|
standard |
500 | 1 | 24 hours |
short |
550 | 1 | 4 hours |
long |
350 | 1 | 5 days |
high |
450 | 96 | 2 days |
debug |
500 | 8 | 1 hour |
Once you’ve chosen the partition and QoS you need, in your job script, provide the partition in the --partition
directive and the QoS in the --qos
directive.
The debug
QoS can be used to test new workflows and also to help new
users to familiarise themselves with the Slurm batch system. This QoS should be used when unsure of the job resource
requirements and behavior at runtime because it has a confined set of LOTUS
nodes not shared with the other standard LOTUS queues.
QoS | Priority | Max CPUs per job | Max wall time | Max jobs per user |
---|---|---|---|---|
debug |
500 | 8 | 1 hour | 32 |
The standard
QoS is the most common QoS to use, with a maximum of a single CPU per job and a runtime under 24 hours.
QoS | Priority | Max CPUs per job | Max wall time | Max jobs per user |
---|---|---|---|---|
standard |
500 | 1 | 24 hours | 4000 |
The short
QoS is for shorter jobs (under 4 hours) and has a maximum of a single CPU per job.
QoS | Priority | Max CPUs per job | Max wall time | Max jobs per user |
---|---|---|---|---|
short |
550 | 1 | 4 hours | 2000 |
The long
QoS is for jobs that will take longer than 24 hours but will have a lower priority than standard
. It also has a maximum of a single CPU per job.
QoS | Priority | Max CPUs per job | Max wall time | Max jobs per user |
---|---|---|---|---|
long |
350 | 1 | 5 days | 1350 |
The high
QoS is for jobs with larger resource requirements, for example CPUs per job and memory.
QoS | Priority | Max CPUs per job | Max wall time |
---|---|---|---|
high |
450 | 96 | 2 days |
Slurm accounting by project has been introduced as a means of monitoring compute usage by projects on JASMIN. These projects align with group workspaces (GWSs), and you will automatically be added to Slurm accounts corresponding to any GWS projects that you belong to.
To find what Slurm accounts and quality of services (QoS) that you have access to, use the useraccounts
command on any sci
machine.
Output should be similar to one or more of the lines below.
useraccounts
# sacctmgr show user fred withassoc format=user,account,qos%-50
User Account QOS
---------- -------------- -------------------------------------
fred mygws debug,high,long,short,standard
fred orchid debug,high,long,short,standard
You should use the relevant account for your project’s task with the --account
directive in your job script.
Users who do not belong to any group workspaces will be assigned the no-project
account and should use that in their job submissions.
Please ignore and do not use the group shobu
.