How to monitor Slurm jobs
Getting information about your Slurm jobs
Information on all running and pending batch jobs managed by Slurm can be
obtained from the Slurm command squeue
. Note that information on completed
jobs is only retained for a limited period. Information on jobs that ran in
the past is via sacct
. A (simplified) example of the output squeue
is shown below.
squeue -u fred
JOBID PARTITION QOS NAME USER ST TIME NODES NODELIST(REASON)
18957 standard standard mean fred R 0:01 1 host147
18967 debug standard wrap fred R 14:25 1 host146
The ST
field is the job state and the TIME
is the time used by the
job. You may also see TIME_LEFT
, CPUS
(number of CPUs for the job), PRIORITY
, and NODELIST(REASON)
which shows which hosts the job is running on and why the job is in the current state.
The -u fred
argument restricts the squeue
output about user fred
. Alternatively,
use squeue --me
which means “my own jobs”.
Official documentation for the squeue
command is available
here
.
Please DO NOT use watch
or equivalent polling utilities with Slurm
as they are wasteful of resources and cause communication issues for the scheduler.
Your process may be killed if this is detected.
A batch job evolves in several states in the course of its execution. The typical job states are defined below:
Symbol | Job state | Description |
---|---|---|
PD | Pending | The job is waiting in a queue for allocation of resources |
R | Running | The job currently is allocated to a node and is running |
CG | Completing | The job is finishing but some processes are still active |
CD | Completed | The job has completed successfully |
F | Failed | Failed with non-zero exit value |
TO | Terminated | Job terminated by Slurm after reaching its runtime limit |
S | Suspended | A running job has been stopped with its resources released to other jobs |
ST | Stopped | A running job has been stopped with its resources retained |
A list of the most commonly used commands and their options for monitoring batch jobs are listed below:
Slurm Command | Description |
---|---|
squeue |
To view information for all jobs running and pending on the cluster |
squeue --user=username |
Displays running and pending jobs per individual user |
squeue --me |
Displays running and pending jobs for the current user |
squeue --states=PD |
Displays information for pending jobs (PD state) and their reasons |
squeues --states=all |
Shows a summary of the number of jobs in different states |
scontrol show job JOBID |
Shows detailed information about your job (JOBID = job number) by searching the current event log file |
sacct -b |
Shows a brief listing of past jobs |
sacct -l -j JOBID |
Shows detailed historical job information of a past job with jobID |
An example of the job record from a simple job submitted to Slurm:
sbatch -A mygws -q debug -p debug --wrap="sleep 2m"
Submitted batch job 18973
Then we can take the job ID from Slurm for the next command:
scontrol show job 18973
JobId=18973 JobName=wrap
UserId=fred(26458) GroupId=users(26030) MCS_label=N/A
Priority=1 Nice=0 Account=jasmin QOS=standard
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:08 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2020-05-20T14:10:28 EligibleTime=2020-05-20T14:10:28
AccrueTime=2020-05-20T14:10:28
StartTime=2020-05-20T14:10:32 EndTime=2020-05-20T15:10:32 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-05-20T14:10:32
Partition=test AllocNode:Sid=sci2-test:18286
ReqNodeList=(null) ExcNodeList=(null)
NodeList=host147
BatchHost=host147
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=128890M,node=1,billing=1
Socks/Node=*NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=128890M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/users/fred
StdErr=/home/users/fred/slurm-18973.out
StdIn=/dev/null
StdOut=/home/users/fred/slurm-18973.out
Power=
sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
18963 wrap par-single jasmin 1 COMPLETED 0:0
18964 wrap short-ser+ jasmin 1 COMPLETED 0:0
18965 wrap par-single jasmin 1 COMPLETED 0:0
18966 wrap short-ser+ jasmin 1 COMPLETED 0:0