GPU testing

This article provides preliminary details on how to test a job on the LOTUS GPU cluster:

GPU cluster

The JASMIN GPU cluster is composed of 3 GPU hosts. Two hosts with two GPUs 2 x GV100GL while one host has four GPUs 4 x GV100GL. GPU RAM is 32 GB

The SLURM batch queue is 'lotus_gpu' with Maximum runtime of 168 hours and the default runtime is 1 hour 

How to test a GPU job

Testing a job on the JASMIN GPU cluster can be carried out in an interactive mode by launching a pseudo-shell terminal SLURM job from JASMIN scientific servers:

$ srun --partition=lotus_gpu --account=lotus_gpu --pty /bin/bash
cpu-bind=MASK - gpuhost592, task  0  0 [140444]: mask 0xfffffffff set 
[freddy@gpuhost592 ~]

The GPU host gpuhost592 is allocated.

Note that for batch mode a GPU job is submitted using the SLURM command 'sbatch':

sbatch --partition=lotus_gpu --account=lotus_gpu mygpujob.sbatch

or by adding the SLURM directive #SBATCH partition=lotus_gpu and #SBATCH --account=lotus_gpu in the preamble of the job script file 

Software Installed on the GPU cluster

DGX2/Pearl setup has been implemented on all 3 nodes. The hosts include:

- CUDA drivers 10.1, and CUDA libraries 10.0 and 10.1

- CUDA DNN (Deep Neural Network Library)

- NVIDIA container runtime (see notes below)

- NGC client (GPU software hub for NVIDIA)

- Singularity 3.4.1 - which supports NVIDIA/GPU containers

- SCL Python 3.6

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.