Example Job 2: Calculating MD5 Checksums on many files

This page records some early CEDA usage of the LOTUS cluster for various relatively simple tasks. Others may wish to use these examples as a starting point for developing their own workflows on LOTUS.

Case 1: Calculating MD5 Checksums on many files

This is a simple case because (1) the archive only needs to be read by the code and (2) the code that we need to run involves only the basic linux commands so there are no issues with picking up dependencies from elsewhere.

Case Description

  • we want to calculate the MD5 checksums of about 220,000 files. It will take a day or two to run them all in series.
  • we have a text file that contains 220,000 lines - one file per line.

Solution under LOTUS

  • Split the 220,000 lines into 22 files of 10,000 lines.
  • Write a template script to:
    • Read a text file full of file paths
    • Run the md5sum command on each file and log the result.
  • Write a script to create 22 new scripts (based on the template script), each of which takes one of the input files and works through it.

And this is how it looks

Log in to jasmin-sci1 server:

$ ssh -A <username>@sci1.jasmin.ac.uk

Split the big file

$ split -l 10000 -d file_list.txt # Produces 22 files called "x00"..."x21"

Create the template file: scan_files_template.sh

#SBATCH -e %J.e  


while read f ; do         

  /usr/bin/md5sum $f >> /home/users/astephen/sst_cci/output/scanned___INSERT_FILE__.log

done < $infile

Run a script to generate all the script files:

for i in `ls /home/users/astephen/sst_cci/to_scan/` ; do
        cp scan_files_template.txt bin/scan_files_${i}.sh 
        perl -p -i -w -e 's/__INSERT_FILE__/'${i}'/g;' bin/scan_files_${i}.sh 

Submit all 22 jobs to LOTUS:

for i in `ls /home/users/astephen/sst_cci/to_scan/` ; do      
    echo $i     
    sbatch -p short-serial -o /home/users/astephen/sst_cci/output/$i /home/users/astephen/sst_cci/bin/scan_files_${i}.sh  

Watch the jobs running:

$ squeue -u <username>

And the result

All jobs ran within about an hour.

Case 2: Checksumming CMIP5 Data

A variation on Case 2 has been used for checksumming datasets in the CMIP5 archive. The script below will find all NetCDF files in a DRS dataset and generate a checksums file and error log. Each dataset is submitted as a separate bsub job.

Checksum a CMIP5 dataset
usage: checksum_dataset.py dataset_id ...
    where dataset_id is a full drs id including version 
    e.g. cmip5.output1.MOHC.HadGEM2-ES.historical.6hr.atmos.6hrLev.r1i1p1.v20110921
import os
import os.path as op
import sys
import optparse

DRS_ROOT = '/badc/cmip5/data'

def submit_job(dataset):
    # Assume version is in the dataset-id for now
    parts = dataset.split('.')
    path = op.join(DRS_ROOT, '/'.join(parts))

    if not op.exists(path):
        raise Exception('%s does not exist' % path)
    job_name = dataset
    cmd = ('bsub -q lotus -J {job_name} '
           '-o {job_name}.checksums -e {job_name}.err '
           "/usr/bin/md5sum '{path}/*/*.nc'").format(job_name=job_name,

def main():
    parser = optparse.OptionParser(description='Checksum DRS datasets')
    (options, args) = parser.parse_args()

    datasets = args
    for dataset in datasets:

if __name__ == '__main__':

If you have a file containing a list of dataset ids you can submit each as a separate job by invoking the above script as follows:

$ ./checksum_dataset.py $(cat datasets_to_checksum.dat) 

sbatch-q short-serial -J cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128 -o cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.checksums -e cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.err /usr/bin/md5sum '/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/day/seaIce/day/r1i1p1/v20111128/*/*.nc' 

Job <745307> is submitted to queue <lotus>.  ...

Still need help? Contact Us Contact Us