How to control jobs
This article shows how to control jobs that have been submitted to LOTUS. It covers the following:
- How to modify job options (Wall time, resource reservation)
- Suspend and resume a job
- Move a job to the top or bottom of a queue
- Move a job between queues
- Kill a job
How to modify job options
Job submission parameters can be modified using the command
bmod depending on the state of the job. Jobs can be modified when they are both pending and running.
Modify a pending job
If your submitted jobs are pending (
bjobs shows the job in the "PEND" state) you can modify the job submission parameters. You can also modify entire job arrays or individual elements of a job array.
- To replace the job command-line, run
bmod -Z "new_command". For example:
bmod -Z "myjob file" 101.
- To change a specific job parameter, run
bmod -b. The specified options replace the submitted options.
The following example changes the start time of job 101 to 2:00 a.m.:
$ bmod -b 2:00 101
- To reset an option to its default submitted value (undo a
bmod), append the
ncharacter to the option name and do not include an option value.
The following example resets the start time for job 101 back to its default value:
$ bmod -bn 101
Modify a running job
If your submitted job is running (
bjobs shows the job in the "RUN" state) you can modify only some of the job options such as including resource reservation, wall time and memory limit. You must be the job owner or an LSF administrator to modify a running job.
Modify the resource reservation
A job is usually submitted with a resource reservation for the maximum amount required. Use this command to decrease the reservation. Run
bmod -R to modify the resource reservation for a running job.
For example, to modify the resource reservation for job 101 to 20GB of memory:
$ bmod -R "rusage[mem=20000]" 101
Modify job options
The appropriate options for the command
bmod in order to modify the run-time limit, memory limit and job error files for a running job are:
- Run limit:
-We <HH:MM> <job_id> | -Wen
- Memory limit:
-M <mem_limit> <job_id> | -Mn
- Standard error file name:
-e <error_file> <job_id> | -en
- Standard output file name:
-o <output_file> <job_id> | -on
Suspend and resume a job
You can resume or suspend a job using the
bresume commands. A job can be suspended by its owner or the LSF administrator with the
bstop command. These jobs are considered user-suspended and are displayed by
bjobs as "USUSP".
When the user restarts the job with the
bresume command, the job is not started immediately to prevent overloading. Instead, the job is changed from "USUSP" to "SSUSP" (suspended by the system). The "SSUSP" job is resumed when host load levels are within the scheduling thresholds for that job, similarly to jobs suspended due to high load.
For example to stop and then resume job 6678, enter the following
$ bstop 6678 Job <6678> is being stopped $ bresume 6678 Job <6678> is being resumed
Move a job to the bottom/top of a queue
Use the LSF command
bbot to move jobs relative to your last job in the queue. You must be an LSF administrator or the user who submitted the job. Use
btop to move jobs relative to your first job in the queue. By default, LSF dispatches jobs in a queue in the order of arrival (that is,first-come-first-served), subject to availability of suitable compute hosts. Please consult the LSF Documentation for more information.
Move a job between queues
A pending job can be moved from its current queue to a different queue by using the
bmod -q <queue-name> <jobID> command, where
<queue-name> is the selected queue to which a job is moved to. An example to move a job with
<jobid> 4105225 from the
par-multi queue to the
par-single queue is shown here:
$ bjobs 4105225 fchami PEND par-multi jasmin-sci1 multinodes Oct 31 14:28 $ bmod -q par-single 4105225 Parameters of job <4105225> are being changed $ bjobs 4105225 fchami PEND par-single jasmin-sci1 multinodes Oct 31 14:28
Note: Resources that the job requests must be within the resource allocation limits of the selected queue.
Kill a job
You can cancel a job from running or pending by killing it. The
bkill command causes LSF to send the SIGINT and SIGTERM signals to a job to give it a chance to clean up, and then LSF sends the SIGKILL signal to kill the job. Example to kill a job with jobID 3421 is shown here:
$ bkill 3421 Job <3421> is being killed
Note: If you use the
jobscommand immediately after the
bill command on a running job, it will often show the job as still being in the RUN state. This is normal. There is no need to issue another
bkill command. Doing so will not kill the job any faster. It sometimes takes several minutes for a “bkill” command to end a large parallel job.
To kill all of your pending jobs you can use the following combination of LSF and Linux commands where
<username> is your username:
$ bkill `bjobs -u <username> |grep PEND |cut -f1 -d" "`
If a job cannot be killed in the operating system, you can force the removal of the job from LSF. The
bkill -r command removes a job from the system without waiting for the job to terminate in the operating system. This sends the same series of signals as
bkill except that the job is removed from the system immediately.