Orchid GPU cluster
Details of JASMIN's GPU cluster, ORCHID
The JASMIN GPU cluster is composed of 16 GPU nodes:
Before using ORCHID on JASMIN, you will need:
jasmin-login
access role:
Apply here
jasmin-login
has been approved and completed), the orchid
access role:
Apply here
The jasmin-login
access role ensures that your account is set up with access to the LOTUS batch processing cluster, while the orchid
role grants access to the special LOTUS partition used by ORCHID.
Holding the orchid
role also gives access to the GPU interactive node.
Note: In the supporting info on the orchid
request form, please provide details
on the software and the workflow that you will use/run on ORCHID.
Testing a job on the JASMIN ORCHID GPU cluster can be carried out in an
interactive mode by launching a pseudo-shell terminal Slurm job from a JASMIN
scientific server e.g. sci-vm-01
:
srun --gres=gpu:1 --partition=orchid --account=orchid --qos=orchid --pty /bin/bash
srun: job 24096593 queued and waiting for resources
srun: job 24096593 has been allocated resources
At this point, your shell prompt will change to the GPU node gpuhost016
, which is allocated for this interactive session on ORCHID.
You will have the one GPU allocated at this shell, as requested:
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| ... | ... | ... |
Note that for batch mode, a GPU job is submitted using the Slurm command
sbatch
:
sbatch --gres=gpu:1 --partition=orchid --account=orchid --qos=orchid gpujobscript.sbatch
or by adding the following preamble in the job script file
#SBATCH --partition=orchid
#SBATCH --account=orchid
#SBATCH --gres=gpu:1
Note 1: gpuhost015
and gpuhost016
are the two largest nodes with 64 CPUs and
8 GPUs.
Note 2: CUDA Version: 11.6
Note 3: The Slurm batch partition/queue orchid
has a maximum runtime of 24 hours and
the default runtime is 1 hour. The maximum number of CPU cores per user is
limited to 8 cores. If the limit is exceeded then the job is expected to be in
a pending state with the reason being
QOSGrpCpuLimit
There is an interactive GPU node gpuhost001.jc.rl.ac.uk
, with the same spec as
other Orchid nodes, that you can access via a login server to prototype and
test your GPU code prior to running as a batch job:
ssh -A gpuhost001.jc.rl.ac.uk