GPU testing
This article provides preliminary details on how to test a job on the LOTUS GPU cluster.
IMPORTANT: Access to the GPU cluster is restricted to users who are members of the lotus_gpu account. Please contact JASMIN support to request access to the account.
GPU cluster
The JASMIN GPU cluster is composed of 3 GPU hosts. Two hosts with two GPUs 2 x GV100GL while one host has four GPUs 4 x GV100GL. GPU RAM is 32 GB
The SLURM batch queue is 'lotus_gpu' with Maximum runtime of 168 hours and the default runtime is 1 hour
How to test a GPU job
Testing a job on the JASMIN GPU cluster can be carried out in an interactive mode by launching a pseudo-shell terminal SLURM job from JASMIN scientific servers:
$ srun --gres=gpu:1 --partition=lotus_gpu --account=lotus_gpu --pty /bin/bash
cpu-bind=MASK - gpuhost592, task 0 0 [140444]: mask 0xfffffffff set [freddy@gpuhost592 ~]
The GPU host gpuhost592 is allocated for this interactive session on LOTUS
Note that for batch mode, a GPU job is submitted using the SLURM command 'sbatch':
$ sbatch --gres=gpu:1 --partition=lotus_gpu --account=lotus_gpu mygpujob.sbatch
or by adding the SLURM directive #SBATCH --partition=lotus_gpu and #SBATCH --account=lotus_gpu, #SBATCH --gres=gpu:1 in the preamble of the job script file
Software Installed on the GPU cluster
DGX2/Pearl setup has been implemented on all 3 nodes. The hosts include:
- CUDA drivers 10.1, and CUDA libraries 10.0 and 10.1
- CUDA DNN (Deep Neural Network Library)
- NVIDIA container runtime (see notes below)
- NGC client (GPU software hub for NVIDIA)
- Singularity 3.4.1 - which supports NVIDIA/GPU containers
- SCL Python 3.6