How to submit jobs

This article explains how to submit batch jobs to LOTUS. It covers:

  • What is a batch job?
  • Job submission methods
  • Important job submission options
  • Job array submission
  • Job dependencies submission
  • Recursive job submission

What is a batch job?

High performance parallel computing codes generally run in "batch" mode. Batch jobs are controlled by scripts written by the user and submitted to a batch system that manages the compute resource and schedules the job to run based on a set of policies. From now on we will use the term "job" to refer to a "batch job".

Job submission methods

Jobs are submitted using the  bsub command. A job can be submitted on the command-line or via a job script file.

1. Command-line

In its simplest form a job can be submitted to the default queue  using a command such:

$ bsub -o %J.out -W 00:10  /bin/hostname

This will run the specified command and write standard output and error to a file with the job number %J in the filename. If the output file is not specified the scheduler will attempt to e-mail you the resulting file via e-mail addresses listed in your $HOME/.forward file. The -W 00:10 option specifies that the predicted maximum wall time of the job (the time it takes to actually run) is 10 minutes - the importance of this argument is explained below.

There are many additional options that can be used to control the resources allocated to an individual job including the queue to submit to, the wall time limit ( -W option), and any processor or memory allocation/limitations for your job.

2. Script file

For more complex jobs, running a sequence of commands or requiring more complicated environment configuration, can put all together in a single Bash script file e.g. demo.bsub:

#BSUB -q short-serial 
#BSUB -o %J.out 
#BSUB -e %J.err 
#BSUB -W 00:10

echo "Running hostname"
sleep 2m

This Bash script contains a series of directives, prefixed with `#BSUB` which provides the same functionality as the command line arguments above. In this example, the LOTUS serial queue has been specified and the standard error and output job files are written to separate files.

To submit a job using the Bash script file it is important to use a redirect such as:

$ bsub < demo.bsub

where demo.bsub is a file containing the text above. Omitting the redirect (i.e. the character<) will result in some or all of the arguments in the Bash script file being ignored by LSF demo.bsub 

Note: Jobs can be submitted to LOTUS using the Rose-cylc workflow manager which is installed on the jasmin-cylc server

Important job submission options

There are many options available for the bsub command, some of which should be included in every job, if possible. These are summarised below (Table 1):

Table 1. Job submission optionsRecommended options are underlined

bsub option                           Description
  -q <queue-name> Submit job to the specified queue. By default this is "short-serial".
  -W HH:MM Request wall time limit of HH hours and MM minutes, by default this is set to one hour.
  -o <file_name> Write output to <file name>, "%J" can be used to include the JOBID, e.g. "J%.out"
  -oo <file_name> As `-o` but overwriting output file if it exists
  -e <file_name> Write errors to <file_name>, "%J" can be used to include the JOBID, e.g. "J%.err"
  -eo <file_name> As above but overwriting error file if it exists
  -R "rusage[mem=XXX]" Request a specific amount of memory for your job. XXX is the memory size in MB. For jobs using more than 4 GB of memory please include this option (see <page on estimating job resources>)
  -M XXX Sets the memory limit to XXX (XXX is the memory size MB). LSF kills the job if this limit is exceeded. For jobs using more than 4 Gb of memory please included this option
  -x Require exclusive access to the host on which the job runs. This is may be required in certain cases, but will usually result in a longer queuing time.
  -J <job_name> Assign the specified name "<job_name>" to the job.
  -n 16 Request a number of CPU cores  (16 in this case).
  -J "<arr_name>[1-10]" Name and create the job array "<arr_name>" with index list (see section below).

Don't forget to set your job time limit!

The default value for the -W (predicted wall time) is 1 hour. If you do not specify this option and/or your job exceeds the time limit then it will be terminated by the scheduler. The maximum run time limit allowed per job is specific for each queue (see Table 1 LOTUS queues). Any jobs that run for more than this maximum limit will be terminated automatically (even if a longer duration is specified). See establishing job duration for more information.

Job array submission 

Job arrays are groups of jobs with the same executable and resource requirements, but different input files. Job arrays can be submitted, controlled, and monitored as a single unit or as individual jobs or groups of jobs. Each job submitted from a job array shares the same job ID as the job array and is uniquely referenced using an array index. This approach is useful for ‘high throughput' tasks, for example where you want to run your simulation with different driving data or run the same processing task on multiple data files.

Important note: The maximum job array size that LSF is configured for is MAX_JOB_ARRAY_SIZE = 10000. If a Job array of size is greater than 10000 is submitted, LSF will reject the job submission with the following error message: "Job array index too large. Job not submitted."

Taking a simple R submission script as an example:

#BSUB -q short-serial
#BSUB -J R_job
#BSUB -oo R-%J.o
#BSUB -eo R-%J.e
#BSUB -W 8:00
Rscript TestRFile.R dataset1.csv

If you wanted to run the same script TestRFile.R with input file dataset2.csv through to dataset10.csv, you could create and submit a job script for each dataset.  However, by setting up an array job, you could create and submit a single script. 

The corresponding job array script for the above example would look something like:

#BSUB -q short-serial
#BSUB -J R_job[1-10]
#BSUB -oo R-%J-%I.o
#BSUB -eo R-%J-%I.e 
#BSUB -W 8:00
Rscript TestRFile.R datset${LSB_JOBINDEX}.csv

Here the important differences are :

  • The array is created in the job name directive by including elements numbered [1-10]to represent our 10 variations
  • The error and output file have the index %I included in the name.
  • The environment variable $LSB_JOBINDEX in the Rscript command is expanded to give the job index

When the job is submitted, LSF will create 10 tasks under the single job ID.  The job is submitted the familiar way;

$ bsub < Rarray.bsub

If you use the   bjobs command to list your active jobs, you will see 10 tasks with the same Job ID.  The tasks can be distinguished by the [index] under the Job_Name column. Note that individual tasks may be allocated to a range of different hosts on LOTUS.

Recursive job submission

If you plan to use recursive job submission (i.e. submit jobs which submit other jobs), plan this with care and test away from LOTUS first to make sure any loop has an exit.

Job dependencies submission 

Within the LSF batch system, you can schedule a job to run dependent on other LSF jobs. When you submit a job, use bsub -w 'dependency_expression' to specify a dependency expression, usually based on the job states of preceding jobs. LSF will not place your job unless this dependency expression evaluates to TRUE. If you specify a dependency on a job that LSF cannot find (such as a job that has not yet been submitted), your job submission fails.

An example of job dependency usage is being able to start a job when another job finishes: Submit the first  job 

$ bsub < job1.bsub
$ jobs 
5861536   fchami  RUN   par-single jasmin-sci1 16*host290. my-job1 Nov 16 16:51

Submit the second job to the scheduler 

$ bsub -w 'done(5861536)' < job2.bsub