This article introduces the LSF batch queues on LOTUS. It covers:
- Queue name
- Queue details
- Queue priority
- How to use serial queues
- How to use parallel queues
LOTUS has five public queues:
Each queue has an attribute of run length limits (e.g. short, long) and resources. A full breakdown of each queue and its associated resources is shown below in Table 1.
Note 1: For testing new workflows and for new JASMIN users, the testing queue
test should be used as follows:
bsub -q test
Note 2: The default queue is the queue
Queues represent a set of pending jobs, lined up in a defined order and waiting for their opportunity to use resources. Jobs must be submitted to a queue using the
bsub -q <queue_name> command where
<queue_name> is the name of the queue (Table 1. column 1)
Table 1 summarises important specifications for each queue such as run time limits and the number of CPU core limits. If the queue is not selected, LSF will schedule the job to the queue
short-serial by default.
Table 1. LOTUS queues and their specifications
| Queue name
||Max run time||Default run time||Max cores per job||Max cores per user||Priority|
Note 1: Resources that the job requests must be within the resource allocation limits of the selected queue.
Note 2: Any jobs requiring more than 4GB RAM (which is the memory per core for the lowest-specification host type in LOTUS) must specify the memory needed with the
-R flag. Note to estimate and allocate resources for jobs.
Note 3: The default value for the
-W (predicted wall time) is 1 hour for the six LSF queues. If you do not specify this option and/or your job exceeds the maximum run time limit then it will be terminated by the LSF scheduler.
Queue priority defines the order in which queues are searched to determine which job will be processed. Queues are assigned a priority by the LSF administrator, where a higher number has a higher priority. Queues are serviced by LSF in order of priority from the highest to the lowest.
Each of the queues listed above have been given a priority (Table 1. column 6) to ensure fair share scheduling. The shorter run time queues have a higher priority than the longer run time queues to ensure shorter jobs get completed quicker. For example, if a job is pending in the
short-serial queue and likewise for the
long-serial queue, the job in the
short-serial queue will be scheduled to run first. So before submitting jobs to a queue, ensure the most appropriate queue is selected to prevent inefficient scheduling.
If multiple queues have the same priority, LSF schedules all the jobs from these queues in first-come, first-served order.
The test queue
test can be used to test new workflows and also to help new users to familiarise themselves with the LSF batch system. Both serial and parallel code can be tested on the 'test' queue. The maximum runtime is 4 hrs and the maximum number of jobs per user is 8 job slots. The maximum number of cores for a parallel job e.g. MPI, OpenMP or multi-threads is limited to 8 cores. The
test queue should be used when unsure of the job resource requirements and behavior at runtime because it has a confined set of LOTUS hosts -not shared with the other standard LOTUS queues.
Serial and array jobs with a single CPU core should be submitted to one of the following serial queues depending on the job duration and the memory requirement. The default queue is
Serial or array jobs with a single CPU core and run time less than 24 hrs should be submitted to the
short-serial queue. This queue has the highest priority of 30. The maximum number of jobs running per user is 2000 from the
short-serial queue and as long as job's resources are within the resource limit of this queue. An example is shown below:
$ bsub -q short-serial -W 00:05 -o %J.out -e %J.err /bin/hostname Job <2170892> is submitted to queue <short-serial>. $ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 2170892 fchami RUN short-seri jasmin-sci1 host171.jc. */hostname Oct 12 18:55
Note that to display job information without truncating fields, use the wide-format option for the command
Serial or array jobs with a single CPU core and run time greater than 24 hrs and less than 168 hrs (7 days) should be submitted to the queue
long-serial . This queue has the lowest priority of 10 and hence jobs might take longer to be scheduled to run relatively to other jobs in higher priority queues.
$ bsub -q long-serial -o %J.out -e %J.err /bin/hostname Job <2171658> is submitted to queue <long-serial>. $ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 2171658 fchami RUN long-seria jasmin-sci1 host073.jc. */hostname Oct 12 19:06
Serial or array jobs with a single CPU core and high memory requirement (> 64 GB) should be submitted to the
high-mem queue and the job should not exceed the maximum run time limit of 48hrs. This queue is not configured to accept exclusive jobs.
$ bsub -q high-mem -o %J.out -e %J.err /bin/hostname Job <3531310> is submitted to queue <high-mem>. $ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 3387435 fchami PEND high-mem jasmin-sci1 host291.jc */hostname Oct 19 16:38
Jobs requiring more than one CPU core should be submitted to one of the following parallel queues depending on the type of parallelisms such as shared memory or distributed memory jobs.
Shared memory multi-threaded jobs with maximum of 16 threads should be submitted to the
par-single queue. Each thread should be allocated one CPU core. Oversubscribing the number of threads to the CPU cores will cause the job to run very slow. The number of CPU cores should be specified via LSF submission command flag
bsub -n <number of CPU cores> or by adding the LSF directives to
#BSUB -n <number of CPU cores> the job script file. An example is shown below:
$ bsub -q par-single -n 10 < singlenode.bsub Job <2338714> is submitted to queue <par-single>. $ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 2338714 fchami RUN par-single jasmin-sci1 10*host290. singlenode Oct 13 10:15
Note: Jobs submitted with a number of CPU cores greater than 16 will be terminated (killed) by LSF scheduler with the following statement in the job output file:
TERM_PROCESSLIMIT: job killed after reaching LSF process limit. Exited with exit code 254.
Distributed memory jobs with inter-node communication using the MPI library should be submitted to the
par-multi queue. A single MPI process (rank) should be allocated a single CPU core. The number of CPU cores should be specified via the LSF submission command flag
bsub -n <number of CPU cores> or by adding the LSF directives
#BSUB -n <number of CPU cores> to the job script file. An example is shown below:
$ bsub -x -q par-multi -n 24 < multinodes.bsub Job <2338707> is submitted to queue <par-multi>. $ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 2338707 fchami RUN par-multi jasmin-sci1 16*host285.j multinodes Oct 13 10:06 8*host282.jc.rl.ac.uk
Note 1: The number of CPU cores gets passed from LSF submission flag
-n. Do not add the
-np flag to
Note 2: Adding
-x option to the
bsub command puts the host running your job into exclusive execution mode and hence avoid sharing with other jobs. This is recommended for very large memory jobs or parallel MPI jobs only.
Note 3: LSF will terminate a job that requires a number of CPU cores greater than the limit of 256.
Reservation of resources
It is possible to make resources available for certain use cases by setting a reservation code for a given resources of compute time, number of CPUs and memory. Please contact CEDA support to enquire on the criteria to request a reservation of resources.