How to monitor SLURM jobs

This article explains how to monitor SLURM jobs It covers:

  • Job information
  • SLURM commands for monitoring jobs 
  • History of jobs
  • Inspection of job output files

Job information

Information on all running and pending batch jobs managed by SLURM can be obtained from the SLURM commandsqueue. Note that information on completed jobs is only retained for a limited period. Information on jobs that ran in the past is via.sacctAn example of the output squeueis shown below.

$ squeue 
      JOBID PARTITION     NAME   USER ST       TIME  NODES NODELIST(REASON)
      18957 short-ser     mean   user1  R       0:01      1 host147
      18956 short-ser     calc   user2  R      48:38      1 host146
      18967      test     wrap   user1  R      14:25      1 host146
where the field 'ST' is the job state and the 'TIME' is the time used by the job.
A batch job evolves in several states in the course of its execution. The typical job states are defined in Table 1

Table 1: Job states

Job state        Description 
PD Pending The job is waiting in a queue for allocation of resources 
 R Running The job currently is allocated to a node and is running
CG Completing The job is finishing but some processes are still active
CD  Completed The job has completed successfully
F Failed Failed with non-zero exit value 
TO Terminated Job terminated by SLURM after reaching its runtime limit 
S Suspended A running job has been stopped with its resources released to other jobs
ST Stopped  A running job has been stopped with its resources retained 

SLURM commands for monitoring jobs

A list of the most commonly used commands and their options for monitoring batch jobs are listed in Table 2, below:

Table 2.  List of important SLURM commands and their options for monitoring jobs 

SLURM              Command  Description 
squeue To view information for all jobs running and pending on the cluster
squeue --user=username Displays running and pending jobs per individual user 
squeue --states=PD Displays information for pending jobs (PD state) and their reasons  
 squeues --states=all Shows a summary of the number of jobs in different states
scontrol show job JOBID Shows detailed information about your job (JOBID = job number) by searching the current event log file  
sacct -b Shows a brief listing of past jobs
sacct -l -j JOBID Shows detailed historical job information of a past job with jobID  
------

Inspection of job output files

An example of the job output file from a simple job submitted to SLURM: 

$ sbatch -p test  --wrap="sleep 2m"
Submitted batch job 18973
$ scontrol show job 18973
JobId=18973 JobName=wrap
   UserId=fchami(26458) GroupId=users(26030) MCS_label=N/A
   Priority=1 Nice=0 Account=jasmin QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:08 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2020-05-20T14:10:28 EligibleTime=2020-05-20T14:10:28
   AccrueTime=2020-05-20T14:10:28
   StartTime=2020-05-20T14:10:32 EndTime=2020-05-20T15:10:32 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-05-20T14:10:32
   Partition=test AllocNode:Sid=sci2-test:18286
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=host147
   BatchHost=host147
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=128890M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=128890M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/users/fchami
   StdErr=/home/users/fchami/slurm-18973.out
   StdIn=/dev/null
   StdOut=/home/users/fchami/slurm-18973.out
   Power=

History of jobs

sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
18963              wrap par-single     jasmin          1  COMPLETED      0:0 
18964              wrap short-ser+     jasmin          1  COMPLETED      0:0 
18965              wrap par-single     jasmin          1  COMPLETED      0:0 
18966              wrap short-ser+     jasmin          1  COMPLETED      0:0 
	

sacct