Batch scheduler (LSF) overview
This article provides information about the LSF Batch Scheduler, it covers:
- What is a Job Scheduler?
- The Platform LSF Scheduler
- What is
- LOTUS queues
- The typical workflow for LOTUS jobs
What is a Job Scheduler?
A job scheduler, or "batch" scheduler, is a tool that manages how user jobs are queued and run on a set of compute resources. In the case of LOTUS the compute resources are the set of compute nodes that make up the LOTUS hardware. Each user can submit jobs to the scheduler which then decides which jobs to run and where to execute them. The scheduler manages the jobs to ensure that the compute resources are being used efficiently and that users get appropriate access to those resources.
The Platform LSF Scheduler
The Platform Load Sharing Facility (known as LSF) is the job scheduler deployed on JASMIN. It allows users to submit, monitor and control jobs on the LOTUS cluster.
General principles for working with LSF
Before learning how to use LSF, it is worthwhile becoming familiar with the basic principles of scheduler operation in order to get the best use out of the LOTUS cluster. Scheduler software exists simply because the amount of jobs that users wish to run on a cluster at any given time is usually greatly in excess of the amount of resources available. This means that the scheduler must queue jobs and work out how to run them efficiently.
Several factors are taken into account during scheduling, such as job duration and size, but the basic principles remain the same throughout - every user gets a fair share on the cluster based on the jobs that they have submitted. This leads to a small number of important principles:
- Do not try to second guess the scheduler! Submit all of your jobs when you want to run them and let the scheduler figure it out for you. You will get a fair share, and if you don't then we need to adjust the scheduler (so get in touch and let us know).
- Give the scheduler as much information as possible. There are a number of optional parameters (see "How to submit jobs") such as job duration, and if you put these in then you have an even better chance of getting your jobs to run.
- It is very difficult for one user to monopolise the cluster, even if they submit thousands of jobs. The scheduler will still aim to give everyone else a fair share, so long as there are other jobs waiting to be run.
Fair share for all users
Figure 1 explains how the fair share system works.
Figure 1. Fair share scheduling on LOTUS: Three users (left column) have jobs in the queue (middle column) which are waiting to run on the cluster (right column). As the blue user's job finishes (middle row), all three users could potentially use the two job slots that become available. However, the orange and purple users already have jobs running, whereas the blue user does not, and as such it is the blue user's jobs that are run (bottom row).
There are five public LSF queues on LOTUS:
high-mem. The default queue is
short-serial. The specification of each queue is described in detail in this article: LSF queues on LOTUS
Queues other than the five public queues should be ignored unless you have been specifically instructed to use them.
~/.lsbatch is a sub-directory under your home directory. This directory contains your batch job working files, such as temporary job script files automatically created by the LSF Batch system, buffered stdout, stderr, etc. This directory is automatically created by
sbatchd on the execution host if it does not already exist. The files associated with a job will be cleaned up when the job finishes.
You should never attempt to remove, edit, or rename these files while jobs are running; otherwise, your batch jobs may fail.
The typical workflow for LOTUS jobs
One of the great advantages of using JASMIN is the ability to create batch jobs that run simultaneously on multiple LOTUS nodes. However, users familiar with running interactively on a single machine often take time to adapt to this new way of working. The change involves moving from a "watching your job run" approach to "submitting your job and coming back later".
The typical workflow for setting up and running LOTUS jobs is as follows:
- Login to one of the scientific analysis servers.
- Install/write/configure your processing code.
- Test your code interactively: run it locally in a single-process test case.
- Create a wrapper script for your code that allows multiple versions to run independently: e.g. running for different dates or processing different spatial regions/variables.
- Submit your jobs via the batch script.
- Monitor your jobs.
- Gather/analyse/review the outputs as required.
Project-specific LOTUS queues
Occasionally a project has a specific requirement for a collection of compute nodes that involves provision of a project-specific queue. If you are working on such a project you project lead will provide guidance on which queue to use. Please contact us If you are interested in setting up a project-specific queue.