Using the JASMIN Object Store

This article describes how to use the JASMIN high-performance object storage.

What is object storage?

An object store is a data storage system that manages data as objects referenced by a globally unique identifier, with attached metadata. This is a fundamental change from traditional file systems that you may be used to, as there is no directory hierarchy - the objects exist in a single flat domain. These semantics allow the object store to scale out much more easily than a traditional shared file system.

The other fundamental change is that the data is no longer accessed by mounting a file system onto a host and referencing a file path (where authentication is "can I log in to the host"). Instead, the data is accessed over HTTP, with authentication using HTTP headers. This has many benefits, the biggest of which is that we can make the object store available outside of the JASMIN firewall, for example to the JASMIN External Cloud. Data can be read and written in the same way, using the same tools, from inside and outside JASMIN. Contrast this with Group Workspaces, where you must be logged in to a JASMIN host in order to write data using the file system, and data is only accessible externally in a readonly way using HTTP or OPeNDAP.

Object stores are seen as the most efficient (and cheapest!) way to store and access data from the cloud, and all the major cloud providers support some variant of object store. The JASMIN object store is S3 compatible - S3 is the object store for Amazon Web Services (AWS), and has become a de-facto standard interface for object stores. This means that all the same tools that work with AWS S3 will also work with the JASMIN object store.

Accessing the object store

Before you can access the object store, you must request a tenancy from us. To do this, please contact the CEDA helpdesk.

Your object store tenancy will have a name, which will usually have a -o suffix - for the rest of this article, we will use my-os-tenancy-o. This is used to identify your tenancy in URLs.

To join an object store tenancy, navigate to the "Services" section in the JASMIN Accounts Portal and select the "Object store" category. Select a tenancy and submit a request to join. This request will then be considered by the service manager and either accepted or rejected.

URLs for internal and external access

Although the data is exactly the same in both cases, a slightly different URL must be used depending on whether you are accessing the object store from inside or outside JASMIN.

From inside JASMIN, including LOTUS and the Scientific Analysis servers, my-os-tenancy-o.s3.jc.rl.ac.uk should be used.

From outside JASMIN, including from the External Cloud, my-os-tenancy-o.s3-ext.jc.rl.ac.uk should be used - note the additional -ext.

Creating an access key and secret

Authentication with the object store uses an access key and secret that are separate to your JASMIN username and password. You can generate an access key and secret using the Caringo portal. This portal is not currently available outside of JASMIN - you will need to use a graphical session on JASMIN to access a Firefox browser running on a JASMIN system. Currently, this can only be done using X11 Forwarding on your SSH connection:

$ ssh -AY <user>@jasmin-sciX.ceda.ac.uk firefox

Once you have Firefox open, navigate to http://my-os-tenancy-o.s3.jc.rl.ac.uk/_admin/portal. You will see a login screen where you should enter your JASMIN username and password:

Upon successfully entering the username and password of a user who belongs to the tenancy, you will see a dashboard. To create an access key and secret, click on the cog icon and select "Tokens":

On the tokens page, click "Add":

In the dialogue that pops up, enter a description for the token and set an expiration date. Make sure to click "S3 Secret Key" - this will expose an additional field containing the secret key. Make sure you copy this and store it somewhere safe - you will not be able to see it again! This value will be used whenever the "S3 secret key" is required.

Once the token is created, it will appear in the list. The "Token" should be used whenever the "S3 access key" is required:

Accessing data in the object store

Using the MinIO client

The MinIO Client is a command line tool to connect to object stores (among other types of file storage) and interface with it as you would with a UNIX filesystem. As such, many of the UNIX file management commands found in standard installations of the OS are found within this client ( ls, cat, cp, rm for example).

There are a number of ways to install this client as shown in the quickstart guide. Methods include: docker, Homebrew for macOS, wget for Linux and instructions for Windows. Follow these steps to get the client installed on the relevant system. If installing the client on Linux, the client can be moved to /usr/bin/ so it can be executed anywhere within the filesystem. Assuming the client is in the directory you're currently in: 

mv ./mc /usr/bin/

To configure the client with the JASMIN object store, create an access key and secret as documented above and insert them into the command:

mc config host add [ALIAS] [S3-ENDPOINT] [TOKEN] [S3 SECRET KEY]

The ALIAS is the name you'll reference the object store when using the client. To demonstrate, if the alias was set to "jasmin-store", displaying a specific bucket in the object store would be done in the following way:

mc ls jasmin-store/my-bucket

The commands available in the client are documented in the quickstart guide (linked above). Copying an object from one place to another is very similar to a UNIX filesystem:

mc cp jasmin-store/my-bucket/object-1 jasmin-store/different-bucket/

From Python

One method of accessing the object store from Python is using s3fs. This library builds on botocore but abstracts a lot of the complexities away. There are three main types of object in this library: S3FileSystem, S3File and S3Map. The filesystem object is used to configure a connection to the object store. Note: it's strongly recommended to store the endpoint, token and secret outside of the Python file, either using environment variables or an external file. This object can be used for lots of the operations which can be done MinIO:

with open('jasmin_object_store_credentials.json') as f:
    jasmin_store_credentials = json.load(f)

jasmin_s3 = s3fs.S3FileSystem(anon=False, secret=jasmin_store_credentials['secret'],
                               key=jasmin_store_credentials['token'],
                               client_kwargs={'endpoint_url': jasmin_store_credentials['endpoint_url']})

my_object_size = jasmin_s3.du('my-bucket/object-1')

S3File is used for dealing with individual files on the object store within Python. These objects can read and written to and from the store:

file_object = s3fs.S3File(jasmin_s3, 'my-bucket/object-1', mode='rb')
# refresh can be set to True to disable metadata caching
file_metadata = file_object.metadata(refresh=False)

# Writing data to variable in Python
file_object.write(data)
# Data will only be written to the object store if flush() is used. This can be executed in S3FS source code if the buffer >= the blocksize
file_object.flush()

S3Map is very useful when using xarray to open a number of data files (netCDF4 for example), and turn them into the zarr format ready to be stored as objects on the store. The function for this can store a .zarr file in a POSIX filesystem, or can be streamed directly to an object store. These datasets can then be opened back into Python:

xarray.open_mfdataset(filepath_list, engine=netcdf4)
s3_store = s3fs.S3Map('my-bucket/zarr-data', s3=jasmin_s3)
dataset.to_zarr(store=s3_store, mode='w')

# Reopening the dataset from object store using xarray
xarray.open_zarr(s3_store, consolidated=True)