Cluster-as-a-Service - Slurm
This article describes how to deploy and use a Slurm cluster using JASMIN Cluster-as-a-Service (CaaS).
CaaS Slurm clusters are currently disabled because of a security problem with the images that were being used. We are working on a new system which will provide slurm clusters.
The Slurm Workload Manager is a popular open-source job scheduler. It provides facilities for executing and monitoring workloads across a set of nodes and managing contention for those nodes by maintaining a queue of pending jobs.
Slurm is a powerful scheduling system, and a full discussion of the available commands and options is beyond the discussion of this article - please consult the Slurm documentation. This article focuses on the specifics of how to deploy and access a Slurm cluster in CaaS.
In CaaS, a Slurm cluster consists of a single login node and several worker nodes. The Linux users and groups on the cluster are managed by the Identity Manager for the tenancy, meaning that SSH access to the nodes can be controlled using FreeIPA groups. User home directories are mounted on all nodes using a shared storage cluster. Slurm is configured with a single queue, to which all the compute hosts are added.
The login node can optionally be assigned an external IP, however external IPs are a scarce resource in the JASMIN Cloud - if you want to preserve your external IPs for other clusters, you can use the Identity Manager gateway host as a jump host.
The following variables are available to configure a Slurm cluster:
|Description||Required?||Can be updated?|
|Identity manager||The CaaS Identity Manager that is used to control access to the cluster.||Yes||No|
|Shared storage||The shared storage cluster to use for user home directories.||Yes||No|
|Worker nodes|| The number of worker nodes in the cluster.
This can be scaled up or down after deployment. When scaling down, there is currently no effort made to drain the hosts in order to remove them gracefully - jobs executing on the removed hosts will fail. This may change in the future.
|Login node size||The size to use for the login node.||Yes||No|
|Compute node size|| The size to use for the compute nodes.
|External IP|| The external IP to attach to the login node.
This is optional - if not given, the cluster can still be accessed by using the Identity Manager's gateway host as a jump host for SSH.
Accessing the cluster
The Slurm hosts are configured to use the users and groups from FreeIPA using SSSD. They are also configured to use SSH keys from FreeIPA for SSH authentication (password-based SSH is disabled).
For every Slurm cluster that is deployed, CaaS automatically creates a group in FreeIPA called
<clustername>_users. This group, along with the
admins group, are permitted SSH access to the hosts in the cluster. To permit a user SSH access to a Slurm cluster, they just need to be added to one of these groups (depending on whether you also want them to be an admin on other clusters).
Once they have been added to one of these groups, the Slurm cluster can be accessed via SSH. The following is an example of accessing a Slurm cluster without an external IP using the Identity Manager's gateway as a jump host:
# Add SSH key to the session $ ssh-add /path/to/ssh/key # SSH to the identity manager gateway with agent forwarding enabled $ ssh -A email@example.com # SSH to the Slurm login node [jbloggs@identity-gateway-0 ~]$ ssh 192.168.3.16 # Check that we are in our home directory [jbloggs@slurm-login-0 ~]$ pwd /home/users/jbloggs # Check the Slurm status [jbloggs@slurm-login-0 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 1-00:00:00 3 idle slurm-compute-[0-2] # Run a simple job [jbloggs@slurm-login-0 ~]$ srun -N3 -l /bin/hostname 0: slurm-compute-0.novalocal 1: slurm-compute-1.novalocal 2: slurm-compute-2.novalocal
A more in-depth discussion of the capabilities of Slurm is beyond the scope of this document - please refer to the Slurm documentation.