Example Job 2: Calculating MD5 Checksums on many files
Sample workflows for LOTUS
This page records some early CEDA usage of the LOTUS cluster for various relatively simple tasks. Others may wish to use these examples as a starting point for developing their own workflows on LOTUS.
This is a simple case because:
md5sum
command on each file and log the result.Log in to the sci
server (use any of sci-vm-0[1-6]
, access from a login
server):
ssh -A <username>@sci-vm-01.jasmin.ac.uk
Split the big file:
split -l 10000 -d file_list.txt Produces 22 files called "x00"..."x21"
Create the template file: scan_files_template.sh
#!/bin/bash
#SBATCH -e %J.e
infile=/home/users/astephen/sst_cci/to_scan/__INSERT_FILE__
while read f ; do
/usr/bin/md5sum $f >> /home/users/astephen/sst_cci/output/scanned___INSERT_FILE__.log
done < $infile
Run a script to generate all the script files:
for i in `ls /home/users/astephen/sst_cci/to_scan/` ; do
cp scan_files_template.txt bin/scan_files_${i}.sh
perl -p -i -w -e 's/__INSERT_FILE__/'${i}'/g;' bin/scan_files_${i}.sh
done
Submit all 22 jobs to LOTUS:
for i in `ls /home/users/astephen/sst_cci/to_scan/` ; do
echo $i
cat /home/users/astephen/sst_cci/bin/scan_files_${i}.sh | sbatch -p short-serial -o /home/users/astephen/sst_cci/output/$i
done
Monitor the jobs by running:
squeue -u <username>
All jobs ran within about an hour.
A variation on Case 2 has been used for checksumming datasets in the CMIP5 archive. The Python code below will find all NetCDF files in a DRS dataset and generate a checksums file and error log. Each dataset is submitted as a separate Slurm job.
"""
Checksum a CMIP5 dataset
usage: checksum_dataset.py dataset_id ...
where dataset_id is a full drs id including version
e.g. cmip5.output1.MOHC.HadGEM2-ES.historical.6hr.atmos.6hrLev.r1i1p1.v20110921
"""
import os
import os.path as op
import sys
import optparse
DRS_ROOT = '/badc/cmip5/data'
def submit_job(dataset):
# Assume version is in the dataset-id for now
parts = dataset.split('.')
path = op.join(DRS_ROOT, '/'.join(parts))
if not op.exists(path):
raise Exception('%s does not exist' % path)
job_name = dataset
cmd = ("echo -e '#!/bin/bash\n"
"srun /usr/bin/md5sum {path}/*/*.nc' "
"| sbatch -p short-serial -J {job_name} "
"-o {job_name}.checksums -e {job_name}.err"
).format(job_name=job_name, path=path)
print(cmd)
os.system(cmd)
def main():
parser = optparse.OptionParser(description='Checksum DRS datasets')
(options, args) = parser.parse_args()
datasets = args
for dataset in datasets:
submit_job(dataset)
if __name__ == '__main__':
main()
If you have a file containing a list of dataset ids you can submit each as a separate job by invoking the above script as follows:
./checksum_dataset.py $(cat datasets_to_checksum.dat)
echo -e '#!/bin/bash
srun /usr/bin/md5sum /badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/day/seaIce/day/r1i1p1/v20111128/*/*.nc' | sbatch -p short-serial -J cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128 -o cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.checksums -e cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.err
Submitted batch job 40898728
...