The JASMIN Notebooks Service with GPUs enabled
JASMIN Notebooks Service with GPUs enabled
The JASMIN Notebook Service has recently been updated to include a GPU-enabled node. This means that JASMIN users can now run Machine Learning (ML) workflows in Notebooks. This page outlines:
The service is available to all JASMIN users that have been granted access to the ORCHID (GPU) cluster. Existing JASMIN users can apply here .
In order to start a Notebook Server with GPUs enabled, go to the initial start page and click on the “Launch Server” button:
Then select the “GPU” option and click “Start”:
Check the top-right corner of a Notebook session to see which kernel that is being used.
If you don’t need any specialist Machine Learning (ML) libraries, you would typically
choose Python 3 + Jaspy
as this has many of the common open-source packages used within
environmental science:
You can click on the name of the kernel to select a different one.
If you want to work with GPUs, you are likely to want to install other packages that
are common in ML, such as PyTorch
and TensorFlow
. This topic is discussed below.
In order to check that your notebook is running on a server with GPUs, you can use the built-in NVIDIA commands, such as:
!nvidia-smi
If GPUs are enabled, the output should look like this:
nvidia-smi
command output
The first section includes:
NVIDIA A100-SXM4-40GB
. Each GPU has 40GB of on-board memory.N/A
because these GPUs are in MIG mode (Multi-Instance GPU), so memory usage is not reported here in the usual way. Memory usage for MIG slices is shown in the dedicated MIG section (below).N/A
for the same reason (MIG is active, so usage must be looked at per MIG instance).The second section introduces MIG (Multi-Instance GPU) . When a GPU is running in MIG Mode, it allows each GPU to be partitioned into multiple instances, each acting as a smaller independent, or virtual, GPU. Because MIG is turned on, you see “N/A” in the normal memory usage fields. Instead, you have a dedicated table for each MIG device:
9984MiB
) of GPU memory. Currently, only 13 MiB is being used, likely overhead.The third section, processes, indicates what is running on the GPU/MIG instances:
In short: There are two physical A100 GPUs. Each is in MIG mode and is presenting one virtual GPU instance with 10GB of memory. Currently, neither GPU has any running processes, so they’re essentially idle. The top-level memory usage fields are “N/A” because MIG splits the GPU resources, and the usage is shown in the MIG devices table below.
The following command will give you the exact IDs of the available GPUs and MIG instances:
!nvidia-smi -L
The output will be something like:
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-2927d07e-3fe9-7904-9e08-b08b82d9a37d)
MIG 1g.10gb Device 0: (UUID: MIG-6e95ef19-5145-571b-b040-7e731f1c1af3)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-e109d8d9-923e-7235-0429-96b7fdbcbd30)
MIG 1g.10gb Device 0: (UUID: MIG-b4bcd4f3-6f69-516d-9404-b5ada80d760b)
The current allocation of GPUs to the JASMIN Notebook Service is as follows:
NVIDIA A100-SXM4-40GB
).In the current release of the Notebook Service, users are required to install their own ML packages for use with GPUs. We recommend this approach:
ml-venv
.
Use our guide to help you.pytorch
and torchvision
, you would run pip install torch torchvision
(including specific versions if needed). NOTE: Many ML packages are very big - this can take several minutes.ipykernel
into your venv and running the relevant command to install the kernel so that JupyterHub can locate it and list it as one of the available kernels. Use the name of your venv as the name of the kernel.It is common to find that different workflows will require different versions of software packages. In the fast-moving world of ML, the libraries and their dependencies often change and this can cause problems when trying to work within a single software environment.
If you encounter this kind of problem, we recommend that you create multiple
virtual environments and their associated kernels. You can then select the
appropriate kernel for each notebook. It may also be worth investing the time
in capturing exact versions of the relevant packages so that you can reproduce
your environment if necessary. Python packages often use a requirements file
(typically named requirements.txt
) to capture the dependencies. For example:
scikit-learn==1.5.1
torch==2.5.1+cu124
torchvision==0.20.1+cu124
All packages listed in a requirements file can be installed with a single command:
$ pip install -r requirements.txt
CUDA is system that connects the Python libraries to the GPU system (on NVIDIA hardware).
When we install PyTorch, or many other ML packages, it should automatically detect CUDA
if it is available. Assuming that you have followed the instructions to create a venv
and install PyTorch
, then you can check for CUDA with:
>>> import torch
>>> print("Is CUDA available? ", torch.cuda.is_available())
Is CUDA available? True
The same thing is possible with TensorFlow
:
>>> import tensorflow as tf
>>> print(tf.config.list_physical_devices('GPU'))
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
Please be aware that installing these packages into your $HOME
directory will
require multiple gigabytes of free space. If you are near your quota (100GB),
then the installation may fail. It is important to note that an
installation failure may not report a violation of disk quota even if
that is the underlying problem.
See the
HOME
directory documentation
for details on checking your current disk usage.
Please make use of GPUs efficiently in your code. If you only need CPUs, then please use the standard Notebook service. One way to ensure that the resource is being efficiently used is to stop your notebook server, via the Hub Control Panel (see the File menu) when not actively needed. Be sure to save your notebook before stopping the server.
The per-user memory limit for a given notebook is given in the bar below (typically 16GB). On the GPU architecture there is 10GiB per virtual GPU.
Experienced JASMIN users will be familiar with the resource limitations of the Notebook Service. Whilst it is great for prototyping, scientific notebooks and code-sharing, it does not suit large multi-process and long-running workflows. The LOTUS cluster is provided for larger workflows, and it includes the ORCHID partition for GPU usage.
We recommend that you use the GPU-enabled Notebook Service to develop and prototype your ML workflows, and migrate them to ORCHID if they require significantly more compute power. Please contact the JASMIN Helpdesk if you would like advice on how to migrate your workflows.
For advice on machine learning packages and suitability for different applications, you could make use of the NERC Earth Observation Data Analysis and AI service (NEODAAS). See the NEODAAS website for details.
An introductory notebook, which includes most of the information provided on this page, is available on GitHub . It may provide a useful starting point for your journey.