How to allocate job resources
This article explains how to allocate resources for batch computing on LOTUS. It covers:
- LOTUS queues
- Job duration
- Memory requirements
- Memory limit control
- High memory host selection
- Exclusive host use
- Number of CPU cores
- Shared scratch space for temporary job files
It is essential in shared resources environment to closely specify how to run a batch job on LOTUS. Hence, allocating resources such as the queue, the memory requirement, the job duration and the number of cores is a requirement and this is done by adding specific options to the job submission command
bsub, as detailed below.
All jobs wait in queues until they are scheduled and dispatched to hosts. The
short-serial queue is the default queue and it should be used for all serial jobs unless there is a memory requirement of over 512 GB per jobs in which case the
high-mem queue should be used. An example on how to set a job to a given queue defined by its queue-name is:
$ bsub -q short-serial < myjob
to view available queues, run the following command:
$ bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP test 40 Open:Active - - - - 0 0 0 0 cpom-comet 35 Open:Active 128 - - - 1664 1536 128 0 rsg-general 35 Open:Active 482 - - - 6 0 6 0 rsgnrt 35 Open:Active 30 - - - 18 0 18 0 copy 30 Open:Active - - - - 0 0 0 0 sst_cci 30 Closed:Inact 96 - - - 0 0 0 0 ingest 30 Open:Active - - - - 1 0 1 0 short-serial 30 Open:Active 3000 2000 - - 49163 46166 2997 0 high-mem 30 Open:Active 96 48 - - 0 0 0 0 par-single 25 Open:Active 512 256 - - 20 0 20 0 par-multi 20 Open:Active 512 256 - - 404 320 84 0 long-serial 10 Open:Active 512 256 - - 31 0 31 0
Queues other than the five public queues:
high-mem should be ignored as they implement different job scheduling and control policies. Queues can use all server hosts in the cluster, or a configured subset of the server hosts.
Note: STATUS is Open and the queue is Active for a job to be dispatched.
-W 00:30 Sets the runtime limit of your job to a predicted time in hours and minutes (e.g. 30 mins) - if you do not specify the run time with -W, the default maximum of 1 hour applies.
Each queue has a specific maximum allowed job duration see Table 1. Any jobs exceeding this limit will be aborted automatically (even if a longer duration is specified)
Specifying memory requirements
Any jobs requiring more then 4GB RAM (which is the memory per core for the lowest-specification host type in LOTUS) must specify the memory needed with the
$ bsub –R “rusage[mem=XXX]”
where XXX is the memory size in unit MB.
Any jobs using extra memory that have been submitted without this flag may be killed by the service administrators if found to be adversely affecting the performance of other users' jobs.
Memory limit control
The memory limit control is enforced on jobs submitted to
long-serialqueues. For jobs with allocated memory requirement greater than 8GB, the memory limit control has to be specified otherwise the default memory limit 8GB will apply and the job will be terminated if it exceeds 8GB.
Note in the following:
$ bsub -R "rusage[mem=XXX]" -M YYY
XXX is in units of MB and YYY is the memory limit in units of MB.
| bsub -R “rusage[mem=XXX]”
bsub -R “rusage[mem=10000]
| the default memory limit of 8000MB (8GB) is enforced
this job will be killed when it exceeds 8000 MB (8GB)
| bsub -R “rusage[mem=XXX]” -M YYY
If YYY < maxlimit = 64000 MB (64GB)
If YYY > maxlimit = 64000 MB (64GB)
bsub -R “rusage[mem=15000]” -M 15000
YYY is enforced
maxlimit of 64 GB is enforced
this job will be killed if it exceeds 15000 MB (15GB)
bsub manual page for more information about the
-M options including other select key words.
Selecting high-memory hosts
The second phase of LOTUS compute, added in spring/summer 2014, enables high-memory nodes to be selected using the
bsub -R and
-Moptions, for example:
$ bsub -R "select[maxmem > 128000]"
This will select from machines with greater than 128000 MB physical RAM (units are always in MB) but this doesn't guarantee how much memory is allocated to that job. To target a host with enough free memory, try adding the resource usage:
$ bsub -R "select[maxmem > 128000] rusage[mem=150000]"
A Job with a high-memory hosts selected to/or greater than 64GB should be submitted to high-mem queue or par-single queue.
$ bsub -R "select[maxmem > 128000] rusage[mem=150000]" -q high-mem
12 such high-memory (512GB) hosts are currently available.
Exclusive host use
-x option to the
bsub command puts the host running your job into exclusive execution mode and hence avoid sharing with other jobs. This is recommended for very large memory jobs or parallel MPI jobs only.
$ bsub -x < myscript
Spanning multiple hosts for additional memory per process:
This is to restrict the number of processes run per host . For example, to run only one process per host use :
$ bsub -R "span[ptile=1]" < myscript
Number of cores
LSF can allocate more than one core to run a job and automatically keeps track of the job status, while a parallel job is running. When submitting a parallel job that requires multiple cores, you can specify the exact number of cores to use.
To submit a parallel job, use
-n <number of cores>; and specify the number of cores/processors the job requires. For example:
$ bsub -n 4 myjob
The job "my job" submits as a parallel job. The job is started when four cores are available.
/work/scratch/no-mpiio (size 250TB) is the largest temporary area. It is on a new flash-based storage which should have a significant performance benefits particularly for operations involving many small files. Please create a subdirectory:
$ mkdir /work/scratch/no-mpiio/newuser
/work/scratch (size 64TB) directory is a temporary filespace that is shared across the whole LOTUS cluster, to allow parallel and MPI-IO jobs to access the same files over the course of their execution. This directory uses the Panasas high speed parallel file system. Please create a subdirectory :
$ mkdir /work/scratch/newuser
Note: However, you should configure your software to use
/work/scratch ONLY if you think you need shared file writes with MPI-IO.
In contrast, the
/tmp directories are all local directories, one per host. These can be used to store small temporary data files for fast access by the local process. Please make sure that your jobs delete any files in
/tmp when they complete. Note also that large volumes of data cannot be stored on the local
/tmp disk. Please use the
/work/scratch directory or group workspaces for large data volumes, but be sure to remove data as soon as possible afterwards.
Data in these directories is temporary and may be arbitrarily removed at any point once your job has finished running. Do not use them to store important output for any significant length of time. Any important data should be written to a group workspace so that you do not lose it.