14  Testing Scripts

14.1 The cardinal rule of testing scripts

You should identify a file that you will take through your workflow or SLURM job that is representative of your other files. One approach is to pick the largest file you will process - that gives you a upper bound on memory usage and CPU utilization.

Then you are going to test your workflow on that representative file.

14.1.1 Profiling Jobs using SLURM

There are two ways of profiling your job:

  • To Profile a running job, use sstat <JOBID> -o alloccpus, allocnodes, avecpu, averss
  • To profile a job that has finished, use sacct -j <JOBID> -o jobid,alloccpus, allocnodes, avecpu, averss

For example, I can look at the usage of the miniwdl run I just ran using:

sacct -j 41381212 -o jobid,jobname,alloccpus,allocnodes,avecpu,averss
JobID           JobName  AllocCPUS AllocNodes     AveCPU     AveRSS 
------------ ---------- ---------- ---------- ---------- ---------- 
41381212_1   run_sbatc+          1          1                       
41381212_1.+      batch          1          1   00:00:00      7724K 
41381212_1.+     extern          1          1   00:00:00      1168K 
41381212_2   run_sbatc+          1          1                       
41381212_2.+      batch          1          1   00:00:00      3236K 
41381212_2.+     extern          1          1   00:00:00      1156K 
41381212_3   run_sbatc+          1          1                       
41381212_3.+      batch          1          1   00:00:00      7596K 
41381212_3.+     extern          1          1   00:00:00      1160K
  • JobID is the subjob id per step
  • JobName is the step run per subjob
  • AllocCPUS is the number of CPUs assigned to the task
  • AllocNodes is the number of Nodes assigned to the task
  • AveCPU is the average CPU time, defined as the compute time * the number of cores
  • AveRSS is the total memory used in that step of the process

So, when you run a single task on the job, you will understand whether you are using the resources of each node properly.

More information on profiling: https://csc-training.github.io/csc-env-eff/hands-on/batch_resources/tutorial_sacct_and_seff.html

14.2 Developing and Testing scripts interactively

One of the hard things to understand is what can be run on a compute node versus the head node, and what file systems are accessible via a compute node.

A lot of the issues you might have is because you need to understand the mental model of how cluster computing works. And the best way to understand that is to test your code on a compute node.

Let’s explore how we can do that. You should also review the material about using screen (Section 16.7).

14.2.1 Testing code on a compute node

Fred Hutch users have the advantage of grabnode, which is a custom command that lets you request an interactive instance of a compute node. (Non-FH folks can usually request this with the -it flag for srun)

Why would you want to do this? A good part of this is about testing software and making sure that your paths are correct.

NoneDon’t rely on grabnode/interactive mode for your batch work

We often see users that will request a multicore node with higher memory, and do their processing on that node.

This doesn’t take advantage of all of the machines that are available on the cluster, and thus is a suboptimal way to utilize the cluster.

When you are doing interactive analysis, such as working in JupyterLab or RStudio, that is a valid way to work. But when you have tasks you can scatter amongst many nodes, requesting a high-spec node isn’t a great way to optimally achieve things.

The other disadvantage is that you may be waiting a very long time to get that multicore node, whereas if you batch across a bunch of nodes, you will get your work done much faster.

14.2.2 Grabbing an interactive shell on a worker

When you’re testing code that’s going to run on a worker node, you need to be aware of what the worker node sees.

It’s also important in estimating how long our tasks are going to run since we can test how long a task runs for a representative dataset.

NoteFor FH Users: grabnode

On the FH system, we can use a command called grabnode, which will let us request a node. It will ask us for our requirements (numbers of cores, memory, etc.) for our node.

tladera2@rhino01:~$ grabnode

grabnode will then ask us for what kind of instance we want, in terms of CPUs, Memory, and GPUs. Here, I’m grabbing a node with 8 cores, 8 Gb of memory, using it for 1 day, and no GPU.

How many CPUs/cores would you like to grab on the node? [1-36] 8
How much memory (GB) would you like to grab? [160] 8
Please enter the max number of days you would like to grab this node: [1-7] 1
Do you need a GPU ? [y/N]n

You have requested 8 CPUs on this node/server for 1 days or until you type exit.

Warning: If you exit this shell before your jobs are finished, your jobs
on this node/server will be terminated. Please use sbatch for larger jobs.

Shared PI folders can be found in: /fh/fast, /fh/scratch and /fh/secure.

Requesting Queue: campus-new cores: 8 memory: 8 gpu: NONE
srun: job 40898906 queued and waiting for resources

After a little bit, you’ll arrive at a new prompt:

(base) tladera2@gizmok164:~$

Now you can test your batch scripts, in order to make sure your file paths are correct. It is also helpful in profiling your job.

If you’re doing interactive analysis that is going to span over a few days, I recommend that you use screen or tmux.

NoneFor Other HPC systems

On a SLURM system, the way to open interactive shells on a node has changed. Check your version first:

srun --version

If you’re on a version before 20.11, you can use srun -i --pty bash to open an interactive terminal on a worker:

srun -i --pty bash

If the version is past 20.11, we can open an interactive shell on a worker with salloc.

salloc bash
NoneRemember hostname

When you are doing interactive analysis, it is easy to forget in which node you’re working in. Just as a quick check, I use hostname (Section 16.1) to remind myself whether I’m in rhino, gizmo, or within an apptainer container.