JASMIN Help Site logo JASMIN Help Site logo
  • Docs 
  • Guides 
  • Training 
  • Discussions   

  •   Search this site  

Can't find what you're looking for?

Try our Google custom search, across all JASMIN sites

Docs
  • getting started
    • get started with jasmin
    • generate ssh key pair
    • get jasmin portal account
    • get login account
    • beginners training workshop
    • how to contact us about jasmin issues
    • jasmin status
    • jasmin training accounts
    • tips for new users
    • how to login
    • multiple account types
    • present ssh key
    • reconfirm email address
    • reset jasmin account password
    • ssh auth
    • storage
    • understanding new jasmin storage
    • update a jasmin account
  • interactive computing
    • interactive computing overview
    • check network details
    • login servers
    • login problems
    • graphical linux desktop access using nx
    • sci servers
    • tenancy sci analysis vms
    • transfer servers
    • jasmin notebooks service
    • jasmin notebooks service with gpus
    • creating a virtual environment in the notebooks service
    • project specific servers
    • dask gateway
    • access from vscode
  • batch computing
    • lotus overview
    • slurm scheduler overview
    • slurm queues
    • lotus cluster specification
    • how to monitor slurm jobs
    • how to submit a job
    • how to submit an mpi parallel job
    • example job 2 calc md5s
    • orchid gpu cluster
    • slurm status
    • slurm quick reference
  • software on jasmin
    • software overview
    • quickstart software envs
    • python virtual environments
    • additional software
    • community software esmvaltool
    • community software checksit
    • compiling and linking
    • conda environments and python virtual environments
    • conda removal
    • creating and using miniforge environments
    • idl
    • jasmin sci software environment
    • jasmin software faqs
    • jaspy envs
    • matplotlib
    • nag library
    • name dispersion model
    • geocat replaces ncl
    • postgres databases on request
    • running python on jasmin
    • running r on jasmin
    • rocky9 migration 2024
    • share software envs
  • data transfer
    • data transfer overview
    • data transfer tools
    • globus transfers with jasmin
    • bbcp
    • ftp and lftp
    • globus command line interface
    • globus connect personal
    • gridftp ssh auth
    • rclone
    • rsync scp sftp
    • scheduling automating transfers
    • transfers from archer2
  • short term project storage
    • apply for access to a gws
    • elastic tape command line interface hints
    • faqs storage
    • gws etiquette
    • gws scanner ui
    • gws scanner
    • gws alert system
    • install xfc client
    • xfc
    • introduction to group workspaces
    • jdma
    • managing a gws
    • secondary copy using elastic tape
    • share gws data on jasmin
    • share gws data via http
    • using the jasmin object store
    • configuring cors for object storage
  • long term archive storage
    • ceda archive
  • mass
    • external access to mass faq
    • how to apply for mass access
    • moose the mass client user guide
    • setting up your jasmin account for access to mass
  • for cloud tenants
    • introduction to the jasmin cloud
    • jasmin cloud portal
    • cluster as a service
    • cluster as a service kubernetes
    • cluster as a service identity manager
    • cluster as a service slurm
    • cluster as a service pangeo
    • cluster as a service shared storage
    • adding and removing ssh keys from an external cloud vm
    • provisioning tenancy sci vm managed cloud
    • sysadmin guidance external cloud
    • best practice
  • workflow management
    • rose cylc on jasmin
    • using cron
  • uncategorized
    • mobaxterm
    • requesting resources
    • processing requests for resources
    • acknowledging jasmin
    • approving requests for access
    • working with many linux groups
    • jasmin conditions of use
  • getting started
    • get started with jasmin
    • generate ssh key pair
    • get jasmin portal account
    • get login account
    • beginners training workshop
    • how to contact us about jasmin issues
    • jasmin status
    • jasmin training accounts
    • tips for new users
    • how to login
    • multiple account types
    • present ssh key
    • reconfirm email address
    • reset jasmin account password
    • ssh auth
    • storage
    • understanding new jasmin storage
    • update a jasmin account
  • interactive computing
    • interactive computing overview
    • check network details
    • login servers
    • login problems
    • graphical linux desktop access using nx
    • sci servers
    • tenancy sci analysis vms
    • transfer servers
    • jasmin notebooks service
    • jasmin notebooks service with gpus
    • creating a virtual environment in the notebooks service
    • project specific servers
    • dask gateway
    • access from vscode
  • batch computing
    • lotus overview
    • slurm scheduler overview
    • slurm queues
    • lotus cluster specification
    • how to monitor slurm jobs
    • how to submit a job
    • how to submit an mpi parallel job
    • example job 2 calc md5s
    • orchid gpu cluster
    • slurm status
    • slurm quick reference
  • software on jasmin
    • software overview
    • quickstart software envs
    • python virtual environments
    • additional software
    • community software esmvaltool
    • community software checksit
    • compiling and linking
    • conda environments and python virtual environments
    • conda removal
    • creating and using miniforge environments
    • idl
    • jasmin sci software environment
    • jasmin software faqs
    • jaspy envs
    • matplotlib
    • nag library
    • name dispersion model
    • geocat replaces ncl
    • postgres databases on request
    • running python on jasmin
    • running r on jasmin
    • rocky9 migration 2024
    • share software envs
  • data transfer
    • data transfer overview
    • data transfer tools
    • globus transfers with jasmin
    • bbcp
    • ftp and lftp
    • globus command line interface
    • globus connect personal
    • gridftp ssh auth
    • rclone
    • rsync scp sftp
    • scheduling automating transfers
    • transfers from archer2
  • short term project storage
    • apply for access to a gws
    • elastic tape command line interface hints
    • faqs storage
    • gws etiquette
    • gws scanner ui
    • gws scanner
    • gws alert system
    • install xfc client
    • xfc
    • introduction to group workspaces
    • jdma
    • managing a gws
    • secondary copy using elastic tape
    • share gws data on jasmin
    • share gws data via http
    • using the jasmin object store
    • configuring cors for object storage
  • long term archive storage
    • ceda archive
  • mass
    • external access to mass faq
    • how to apply for mass access
    • moose the mass client user guide
    • setting up your jasmin account for access to mass
  • for cloud tenants
    • introduction to the jasmin cloud
    • jasmin cloud portal
    • cluster as a service
    • cluster as a service kubernetes
    • cluster as a service identity manager
    • cluster as a service slurm
    • cluster as a service pangeo
    • cluster as a service shared storage
    • adding and removing ssh keys from an external cloud vm
    • provisioning tenancy sci vm managed cloud
    • sysadmin guidance external cloud
    • best practice
  • workflow management
    • rose cylc on jasmin
    • using cron
  • uncategorized
    • mobaxterm
    • requesting resources
    • processing requests for resources
    • acknowledging jasmin
    • approving requests for access
    • working with many linux groups
    • jasmin conditions of use
  1.   Batch Computing
  1. Home
  2. Docs
  3. Batch Computing
  4. Example Job 2: Calculating MD5 Checksums on many files

Example Job 2: Calculating MD5 Checksums on many files

 

Share via
JASMIN Help Site
Link copied to clipboard

Sample workflows for LOTUS

On this page
Case 1: Calculating MD5 Checksums on many files   Case Description   Solution under LOTUS   Workflow steps   Case 2: Checksumming CMIP5 Data  

This page records some early CEDA usage of the LOTUS cluster for various relatively simple tasks. Others may wish to use these examples as a starting point for developing their own workflows on LOTUS.

Case 1: Calculating MD5 Checksums on many files  

This is a simple case because:

  1. the archive only needs to be read by the code and
  2. the code that we need to run involves only the basic Linux commands so there are no issues with picking up dependencies from elsewhere.

Case Description  

  • we want to calculate the MD5 checksums of about 220,000 files. It will take a day or two to run them all in series.
  • we have a text file that contains 220,000 lines - one file per line.

Solution under LOTUS  

  • Split the 220,000 lines into 22 files of 10,000 lines.
  • Write a template script to:
    • Read a text file full of file paths
    • Run the md5sum command on each file and log the result.
  • Write a script to create 22 new scripts (based on the template script), each of which takes one of the input files and works through it.

Workflow steps  

Log in to the sci server (use any of sci-vm-0[1-6], access from a login server):

ssh -A <username>@sci-vm-01.jasmin.ac.uk

Split the big file:

split -l 10000 -d file_list.txt  Produces 22 files called "x00"..."x21"

Create the template file: scan_files_template.sh

#!/bin/bash
#SBATCH -A mygws
#SBATCH -p standard 
#SBATCH -q standard
#SBATCH -e %J.e

infile=/home/users/astephen/sst_cci/to_scan/__INSERT_FILE__  

while read f ; do         

    /usr/bin/md5sum $f >> /home/users/astephen/sst_cci/output/scanned___INSERT_FILE__.log

done < $infile

Run a script to generate all the script files:

for i in `ls /home/users/astephen/sst_cci/to_scan/` ; do
    cp scan_files_template.txt bin/scan_files_${i}.sh 
    perl -p -i -w -e 's/__INSERT_FILE__/'${i}'/g;' bin/scan_files_${i}.sh 
done

Submit all 22 jobs to LOTUS:

for i in `ls /home/users/astephen/sst_cci/to_scan/` ; do      
    echo $i    
    cat /home/users/astephen/sst_cci/bin/scan_files_${i}.sh | sbatch -o /home/users/astephen/sst_cci/output/$i   
done

Monitor the jobs by running:

squeue -u <username>

All jobs ran within about an hour.

Case 2: Checksumming CMIP5 Data  

A variation on Case 2 has been used for checksumming datasets in the CMIP5 archive. The Python code below will find all NetCDF files in a DRS dataset and generate a checksums file and error log. Each dataset is submitted as a separate Slurm job.

""" 
Checksum a CMIP5 dataset
usage: checksum_dataset.py dataset_id ...
    where dataset_id is a full drs id including version 
    e.g. cmip5.output1.MOHC.HadGEM2-ES.historical.6hr.atmos.6hrLev.r1i1p1.v20110921
"""
import os
import os.path as op
import sys
import optparse

DRS_ROOT = '/badc/cmip5/data'

def submit_job(dataset):
    # Assume version is in the dataset-id for now
    parts = dataset.split('.')
    path = op.join(DRS_ROOT, '/'.join(parts))


    if not op.exists(path):
        raise Exception('%s does not exist' % path)
    job_name = dataset
    cmd = ("sbatch -A mygws -p standard -q standard -J {job_name} "
            "-o {job_name}.checksums -e {job_name}.err "
            "--wrap \"srun /usr/bin/md5sum {path}/*/*.nc\""
        ).format(job_name=job_name, path=path)
    
    print(cmd)
    os.system(cmd)

def main():
    parser = optparse.OptionParser(description='Checksum DRS datasets')
    (options, args) = parser.parse_args()

    datasets = args
    for dataset in datasets:
        submit_job(dataset)

if __name__ == '__main__':
    main()

If you have a file containing a list of dataset ids you can submit each as a separate job by invoking the above script as follows:

./checksum_dataset.py $(cat datasets_to_checksum.dat)
sbatch -A mygws -p standard -q standard -J cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128 -o cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.checksums -e cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.err --wrap "srun /usr/bin/md5sum /badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/day/seaIce/day/r1i1p1/v20111128/*/*.nc"
Submitted batch job 40898728
...
Last updated on 2025-04-09 as part of:  Remove alert from example job 2 page (c38ed1631)
On this page:
Case 1: Calculating MD5 Checksums on many files   Case Description   Solution under LOTUS   Workflow steps   Case 2: Checksumming CMIP5 Data  
Follow us

Social media & development

   

Useful links

  • CEDA Archive 
  • CEDA Catalogue 
  • JASMIN 
  • JASMIN Accounts Portal 
  • JASMIN Projects Portal 
  • JASMIN Cloud Portal 
  • JASMIN Notebooks Service 
  • JASMIN Community Discussions 

Contact us

  • Helpdesk
UKRI/STFC logo
UKRI/NERC logo
NCAS logo
NCEO logo
Accessibility | Terms and Conditions | Privacy and Cookies
Copyright © 2025 Science and Technology Facilities Council.
Hinode theme for Hugo licensed under Creative Commons (CC BY-NC-SA 4.0).
JASMIN Help Site
Code copied to clipboard