Secondary copy using Elastic Tape

Introduction

Elastic Tape is a system developed for use with JASMIN Group Workspaces (GWSs), enabling the Group Workspace Manager to:

  • Optimise their use of high-performance online disk by moving data to and from cheaper near-line storage
  • Create and manage secondary copies of GWS data

At present, the system is designed only to be used by GWS Managers, rather than individual members of a GWS. It is the responsibility of a GWS Manager to create and manage backups or additional copies of data in a GWS.

The servers used to access Elastic Tape changed in January 2021.  Previous users should note that the server to use now is  et.jasmin.ac.uk

Who can use ET?

ET is only for use by the named GWS manager, i.e. the individual responsible for managing the GWS disk space. The high-performance disk space used for a GWS is a valuable commodity and the role of the GWS Manager involves making best use of the online space. This may mean moving data to tape to free up space online, or taking a copy of online data to make a secondary copy. No undertaking is provided that the secondary copy will exist beyond the lifetime of the Group Workspace itself, hence it is called a secondary copy and not a backup. It is also NOT long-term archive storage: some data in GWSs may need to be earmarked for longer-term archive storage and wider availability via the CEDA Archive, but this is a separate process for which data management plans, ingest processes and metadata need to be put in place. Please contact the CEDA / JASMIN helpdesk (support@jasmin.ac.uk) if this is the case.

Each GWS has a quota of online disk space (agreed at the time of its creation) and initially the ET quota has been set to the same value. So if you have a 10 Tb workspace, you initially have a 10Tb quota of ET storage to match.

How does it work?

Putting data into ET storage involves creating a "batch" of data which is transferred to the ET system. Using either a file list or top-level directory for reference, the system calculates resources needed and creates a batch, identified by a batch ID. This must be retained by the GWS manager as a "ticket" for later retrieval of this batch of data. It is recommended that you assess the data that you plan to transfer so that you have an idea of the overall volume to be transferred before initiating any actual transfer jobs. It is also recommended to test operation with a small set of test data.

Transparent to the user, and asynchronously (so it is not necessary to wait with a terminal window open), the data are transferred first to online cache and then to tape storage. It is not an instant process and the task of migrating data from online cache to tape can take several hours, even days, depending on factors such as the size of the transfer, contention for the tape system and network conditions. An RSS feed and a web page provide updates on the process of data transfer for each batch. Data can later be retrieved, or removed from ET storage via similar tools.

The transfer of data via a batch involves the "registration" of each file in a database so that its existence is recorded.

Command line tools are provided on a dedicated machine within the JASMIN infrastructure, to which GWS managers will be given access. A GWS manager has access to the python tools et_put.py, et_get.py, et_rm.py and et_ls.py. Some initial documentation for these command line tools is attached.

What should I do next?

It is recommended to try sending and retrieving some small data transfers (a few Gb) at first using the documentation below, but the system has been designed to cope with storing entire GWSs. You will need ssh login access to et.jasmin.ac.uk first. This should have been arranged for you by the time you receive this message. If not, please contact the CEDA helpdesk. Once there, you should be able to see your group workspace and try out the commands on a small set of test data.

System overview

Elastic Tape provides the ability to create "batches" of files which are then sent to the storage system, initially to an online disk cache before being written to near-line tape. Batches can later be retrieved, or removed. An alert system provides the user with the ability to monitor the progress of transfer jobs.

The system comprises:

  • A command-line interface on a client machine
  • A backend system, consisting of
    • I/O servers connected to an online disk cache and database
    • A near-line tape system

Configuration file

As a GWS manager, you will normally be responsible for one or more GWSs. The GWS with which you wish to work using ET needs to be specified either in a configuration file in your home directory, or by specifying the workspace as an option in the command line interface.

Certain default settings are set in a system-wide config file at /etc/et_config.

If needed, you need to create a small text file in your home directory named  .et_config, which contains the following, replacing "myworkspace" with the name of your default workspace:

[Batch]
Workspace = myworkspace
				

"myworkspace" should just be the short name of the workspace, not the full path to it.

The workspace specified in any command-line option overrides that specified in the user's ( ~/.et_config) config file, which in turn overrides that specified in the system (/etc/et_config) config file.

Please REMOVE any previous reference to host and port from your individual ~/.et_config file. This setting is now set from the system /etc/et_config file.
Further configuration options are available in the [DIRECTORY] section of the file, see the system-wide file /etc/et_config for examples. The main parameter for which you may wish to override the default is:
outputLevel = workspace|batch|file
			
although these can be over-ridden at the command line anyway. See et_ls.py command documentation below for the meaning of these options.

User interface

Please note that NOT ALL features of the currently-implemented user interface are described here, however we would recommend that users limit their usage to those features described below.

The user interface consists of the following components:

  • et_put.py Put data onto tape
  • et_get.py Retrieve data from tape
  • et_rm.py Remove data from tape
  • et_ls.py List data holdings on tape
  • Alerts Get information about processes and holdings via web interface

The commands are available on host et.jasmin.ac.uk. As a GWS manager you should have been granted login access to this machine using your JASMIN account, however if accessing the host from outside the RAL network you will need to use the login gateways login1.jasmin.ac.uk, login2.jasmin.ac.uk, login3.jasmin.ac.uk or login4.jasmin.ac.uk. Use the -A option or equivalent for agent forwarding in ssh. STFC users may use the STFC VPN to connect to et.jasmin.ac.uk as if it were a local connection.


et_put.py

Put data onto tape.

Synopsis

$ et_put.py [-v] [-l LOGFILE] [-w WORKSPACE] [-c] [-t one-word-tag] [ -f LISTFILE | DIR ]
						

Description

Data files to be stored can be specified either in an input list file (-f) or by specifying the path to the top of a directory tree containing files to be stored. All symbolic links are ignored (see note below). In both cases, the system will analyse the request and create a batch, identified by a BATCH ID, which can later be used to retrieve that set of files from storage. Although the main "put" operation is asynchronous (and does not require you to maintain a terminal connection for its duration), the initial registration step, which creates the BATCH ID is synchronous, so you should wait for this step to complete before disconnecting.

Given current resources, all users of Elastic Tape share the current throughput capacity of about 25 TB/day, which may increase over time. Please consider this when organising your input batches and expectations of completion time. Large numbers of small files will degrade performance.

Options

-v Verbose output
-l LOGFILE Log file in which to record process output
-f LISTFILE Text file containing ABSOLUTE paths of files to be stored, 1 file per line. NB Files and directory names are case-sensitive. The list should not contain any blank lines or extraneous white space.
-w WORKSPACE Name of the group workspace to use. Overrides default set in config file. Case sensitive.
DIR ABSOLUTE path to top of directory tree containing files to be stored
-c Continue if errors encountered.
-t tag Tag batch with descriptive label meaningful to user. Should be single one-word string. Appears as "Batch name" in ET alert output and "Tag" in et_ls output.

Example usage

Simple case, using a file input.list which contains paths of all the files to be included in the batch:

et_put.py -v -l et_put.log -f input.list -w myworkspace

In the following example, the -coption is used to continue on errors. One error that may be encountered is that a file already exists in the system (e.g. has already been "put"). This option causes the system to ignore any errors and continue with the transfer. Note that this should not be used by default (we would rather know about errors and fix them!)

$ et_put.py -v -l et_put.log -f input.list -w myworkspace -c
						

Alternative usage specifying a directory beneath which all files / directories will be included. In this case the directory must be the last parameter in the command:

$ et_put.py -v -l et_put.log -w myworkspace /group_workspaces/jasmin/myworkspace/mydir
						

Symbolic links: Attempting to include symbolic links in an et_put operation should cause an error. You can override this with the -c option (although this will ignore ALL errors), but a better solution is to generate a list file as in the first two examples above. If this list file is generated with a command like find <path> -type f > listfile.txt, then referring to it in the et_put command will ensure that only those files are included in the batch. You can then keep the list file (perhaps named as per the resulting batch ID for your own records.


et_get.py

Retrieve data from tape

Synopsis

$ et_get.py [-v] [-l LOGFILE] [-b BATCHID | -f FILELIST] [-w WORKSPACE] [-r DIR] [-t MAXPROC]<br>
				

Description

Data files to be retrieved should be specified by referring to the batch ID of the batch in which they were stored. If files have been stored by specifying an absolute path e.g. /group_workspaces/jasmin/myworkspace/mydir, the retrieval process will not write the retrieved files to the same location but a new location specified by DIR. The final part of the relative path needs to correspond with the first part of the absolute path of the stored files, e.g. group_workspaces

Proposed best-practice is to create a temporary directory for retrieved data within your workspace, e.g. /group_workspaces/jasmin/myworkspace/ettmp and to do the initial retrieval into that directory. Once you are satisfied that the retrieval has completed correctly, data can be moved back to their original location in the workspace. NB if you need additional storage space for this, please contact the CEDA help desk.

Options

-v Verbose output
-l LOGFILE Log file in which to record process output. Note that the log file location must be capable of accepting multi-threaded input, or this parameter should be omitted and instead the output from the et_get command be piped to the log file from stdout
-b BATCHID ID of batch to be retrieved
-f FILELIST A list of individual files to be retrieved, with one file per line. Note that:
  • entries in the list must contain the full name of the file, including path, just as it was archived;
  • the list should not contain blank lines or any extraneous white space.
-w WORKSPACE name of the group workspace to use. Overrides default set in config file. Case sensitive.
-r DIR ABSOLUTE path of retrieval location
-t MAXPROC Maximum number of worker processes to use in retrieval. MAXPROC recommended to be between 5 and 10. Please feed back your experience of performance improvement with this feature.

Example usage

$ cd /group_workspaces/jasmin/myworkspace
$ mkdir ettmp
$ et_get.py -v -l et_get.log -w myworkspace -b 507 -r /group_workspaces/jasmin/myworkspace/ettmp
				

At this point, data will be transferred into the specified retrieval directory. Files and directories will be restored with their ABSOLUTE path below the retrieval directory. NB this is a synchronous process and you will need to keep your terminal window open to ensure it completes (or run within the screen command if you are familiar with this).

When the retrieval process has finished, you should satisfy yourself that it is correct (using your preferred method). When this is the case, you can move the data to the required location as shown below:

$ mv /group_workspaces/jasmin/myworkspace/ettmp/group_workspaces/jasmin/myworkspace/* /group_workspaces/jasmin/myworkspace
				

et_rm.py

Remove data from tape

Synopsis

$ et_rm.py [-v] -b BATCHID [-w WORKSPACE]
			

Description

Deletes the files in the specified batch from the Elastic Tape system.

Options

-v Verbose output
-b BATCHID ID of batch to be removed
-w WORKSPACE name of the group workspace to use. Overrides default set in config file. Case sensitive.
Example usage:
et_rm.py -v -b 507

et_ls.py

List holdings on tape

Synopsis

$ et_ls.py [-h] [-X XMLSOURCE] [-H] [-b BATCHID] [-w WORKSPACE] [-L {file,batch,workspace}] [-F {text}]
		

Description

Lists the holdings of a workspace within Elastic Tape at the file, batch or workspace level.

Options

-h, --help show this help message and exit
-x XMLSOURCE --xmlsource XMLSOURCE Base XML source, if not default. Note that this has to be compatible with the current base source currently, so can’t be pointed at files, for example
-H --headerWanted Print headers showing column names for text output
-b BATCHID --batchid BATCHID ID of batch by which to filter results
-w WORKSPACE Name of the group workspace to use. Overrides default set in config file. Case sensitive.
-L {file, batch, workspace} --outputLevel {file, batch, workspace} Level of detail to display for results (default is "workspace")
-F {text} --outputFormat {text} Format to use for the display of results

Example usage:

$ et_ls.py -w myworkspace -H -L file -b 504
		

Works with the workspace "myworkspace", selects display of headers in output, results at file level, filter by batchid 504 (i.e. shows the files present in ET in the given batch.)

$ et_ls.py -w myworkspace -H -L batch
		

Works with the workspace "myworkspace", selects display of headers in output, results at batch level (i.e. shows the batches present in ET holdings for this workspace.)


Alerts

The system provides real-time status messages on the progress of operations requested. These services are now available only inside the RAL firewall, so JASMIN users outside of RAL should use the NX graphical desktop service to open a firefox browser on one of the nx-login servers, to access these URLs

In both cases these can be customised to display only alerts from the workspace of interest to the GWS manager.

(replace  WORKSPACE with your workspace name in the above URLs)

Further views

ET Home http://et-monitor.fds.rl.ac.uk/et_user/ET_Home.php?caller=USERNAME

Holdings summary http://et-monitor.fds.rl.ac.uk/et_user/ET_Holdings_Summary.php?caller=USERNAME&workspace=WORKSPACE

(replace USERNAME with your username, WORKSPACE with your workspace name in the above URLs)

Still need help? Contact Us Contact Us