SLURM queues

This article introduces the SLURM scheduler queues/partitions for batch job submissions to LOTUS. It covers:

  • Queue name
  • Queue details
  • How to view the state of SLURM queues
  • How to choose a queue

Queue name 

LOTUS cluster is managed by SLURM and the SLURM queues are:

  • test
  • short-serial
  • long-serial
  • par-single
  • par-multi
  • high-mem
  • short-serial-4hr (see Note 3)

Each queue has an attribute of run-length limits (e.g. short, long) and resources. A full breakdown of each queue and its associated resources is shown below in Table 1. 

Queue details

Queues represent a set of pending jobs, lined up in a defined order, and waiting for their opportunity to use resources. The queue is specified in the job script file using SLURM scheduler directive   #SBATCH -p <partition=queue_name> where <queue_name> is the name of the queue/partition (Table 1. column 1)

Table 1 summarises important specifications for each queue such as run time limits and the number of CPU core limits. If the queue is not specified, SLURM will schedule the job to the queue  short-serial by default.  

Table 1. LOTUS/SLURM queues and their specifications 

Queue name   Max run time  Default run time  Max CPU cores
per job 
Max CpuPer
UserLimit   
Priority 
test                                           4 hrs  1hr  30 
short-serial   24 hrs  1hr  1   2000  30 
par-single 48 hrs  1hr  16  300    25 
par-multi 48 hrs  1hr  256  300    20 
long-serial  168 hrs  1hr  300    10 
high-mem 48 hrs  1hr  75    30 
short-serial-4hr (Note 3) 4 hrs  1hr  1000    30 

Note 1: Resources that the job requests must be within the resource allocation limits of the selected queue. 

Note 2: The default value for --time=[hh:mm:ss] (predicted maximum wall time) is 1 hour for the six SLURM queues. If you do not specify this option and/or your job exceeds the default maximum run time limit then it will be terminated by the SLURM scheduler. 

Note 3: A user must specify the SLURM job account  --account=short4hr  when submitting a batch job to the provisional SLURM partition short-serial-4hr 

How to view the state of SLURM queues

The SLURM command sinfo reports the state of queues/partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.

$ sinfo   
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
test             up    4:00:00     48   idle host[146-193]
short-serial*    up 1-00:00:00     48   idle host[146-193]
long-serial      up 7-00:00:00     48   idle host[146-193]
par-single       up 2-00:00:00     48   idle host[146-193]
par-multi        up 2-00:00:00     48   idle host[146-193]
high-mem         up 2-00:00:00     48   idle host[146-193]
lotus_gpu        up 7-00:00:00     48   idle host[146-193]
copy             up 7-00:00:00     48   idle host[146-193]
cpom-comet       up 7-00:00:00     48   idle host[146-193]
...

Note: Queues other than the standard queuestest short-serial long-serial ,   par-single ,  par-multi  and  high-mem  should  be ignored as they implement different job scheduling and control policies. 

'sinfo' Output field description:

By default, the SLURM command 'sinfo' displays the following information:

  • PARTITION:  Partition name followed by "*" for the default queue/partition
  • AVAIL:  State/availability of a queue/partition. Partition state: up or down.
  • TIMELIMIT: The maximum run time limit per job in each queue/partition is shown in TIMELIMIT in days-    hours:minutes :seconds . e.g. 2-00:00:00 is two days maximum runtime limit 
  • NODES: Count of nodes with this particular configuration e.g. 48 nodes
  • STATE: State of the nodes. Possible states include: allocated, down, drained, and idle. For example, the state "idle" means that the node is not allocated to any jobs and is available for use.
  • NODELIST List of node names associated with this queue/partition

The 'sinfo' example below, reports more complete information about the partition/queue short-serial

$ sinfo --long --partition=short-serial
Tue May 12 18:04:54 2020
PARTITION    AVAIL TIMELIMIT JOB_SIZE  ROOT  OVERSUBS  GROUPS NODES    STATE NODELIST
short-serial* up  1-00:00:00  1-infinite  no  NO    all     48  idle host[146-193]

How to choose a SLURM queue/partition

Test queue

The test queue  test can be used to test new workflows and also to help new users to familiarise themselves with the SLURM batch system.  Both serial and parallel code can be tested on the testqueue. The maximum runtime is 4 hrs and the maximum number of jobs per user is 8 job slots. The maximum number of cores for a parallel job e.g. MPI, OpenMP, or multi-threads is limited to 8 cores. The testqueue should be used when unsure of the job resource requirements and behavior at runtime because it has a confined set of LOTUS nodes (Intel node type) not shared with the other standard LOTUS queues. 

Serial queues

Serial and array jobs with a single CPU core should be submitted to one of the following serial queues depending on the job duration and the memory requirement. The default queue is  short-serial  

short-serial

Serial and/or array jobs with a single CPU core each and run time less than 24 hrs should be submitted to the short-serial queue . This queue has the highest priority of 30. The maximum number of jobs that can be scheduled to start running from the  short-serial  is 2000 jobs whilst  both job's resources are available and user' priority is high

long-serial

Serial or array jobs with a single CPU core and run time greater than 24 hrs and less than 168 hrs (7 days) should be submitted to the queue long-serial .  This queue has the lowest priority of 10 and hence jobs might take longer to be scheduled to run relatively to other jobs in higher priority queues.

high-mem 

Serial or array jobs with a single CPU core and high memory requirement (> 64 GB) should be submitted to the high-mem queue and the required memory must be specified  --mem=XXX  (XXX is in MB units). The job should not exceed the maximum run time limit of 48hrs. This queue is not configured to accept exclusive jobs.

Parallel queues

Jobs requiring more than one CPU core should be submitted to one of the following parallel queues depending on the type of parallelisms such as shared memory or distributed memory jobs.

par-single

Shared memory multi-threaded jobs with  a maximum  of 16 threads should be submitted to  the  par-single  queue . Each thread should be allocated one CPU core. Oversubscribing the number of threads to the CPU cores will cause the job to run very slow. The number of CPU cores should be specified via the submission command line sbatch -n <number of CPU cores> or  by adding the SLURM directive #SBATCH -n <number of CPU cores>in the job script file. An example is shown below:

$ sbatch --ntasks=4 --partition=par-single < myjobscript

Note: Jobs submitted with a number of CPU cores greater than 16  will be terminated (killed) by the SLURM scheduler with the following statement in the job output file:

par-multi

Distributed memory jobs with inter-node communication using the MPI library should be submitted to  the  par-multi  queue . A single MPI process (rank) should be allocated  a  single CPU core. The number of CPU cores should be specified via the SLURM submission command  flag  sbatch -n <number of CPU cores>  or  by adding the SLURM  directive     #SBATCH -n <number of CPU cores>  to  the job script file. An example is shown below:

$ sbatch --ntasks=4 --partition=par-multi < myjobscript<br>

Note 1: The number of CPU cores gets passed from SLURM submission       flag  -n .  Do not add  the  -np  flag   to  mpirun   command .

Note 2: SLURM will reject a job that requires a number of  CPU cores greater than the limit of 256. 

Still need help? Contact Us Contact Us