9 Workflows
9.1 Learning Objectives
- Execute a workflow on the Fred Hutch Cluster
- Modify a workflow and test it on a compute node
- Utilize a container and test inside it on a compute node
9.2 Using a Container
In Section 8.2, we learned a little bit about using Apptainer to run a Docker container. Let’s try to pull a common container, the Genome Analysis Toolkit (GATK) and run things inside the container.
The first thing we need to do is load Apptainer:
module load Apptainer/1.1.6Then we can pull the docker container:
apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1We can check if we have pulled the docker image by using
apptainer cache listOkay, now we have confirmed that we downloaded the apptainer image. Now we can try to execute things with it.
- 1
- Bind path
- 2
- Docker image we have downloaded
- 3
-
samtoolscommand to run.
It’s worth trying this once to make sure you understand how all of the pieces are connected. In general, I do recommend using a workflow runner (Section 9.3) instead, because it helps manage all of these details
9.3 Running Workflows
So far, we’ve been batch processing using job arrays in SLURM. However, when you graduate to processing in multiple steps, you should consider using a workflow manager for processing files.
A workflow manager will take a workflow, which can consist of multiple steps and allow you to batch process files.
A good workflow manager will allow you to:
- Restart failed subjobs in the workflow
- Allow you to customize where intermediate and final outputs go
- Swap and customize modules in your workflow
- Adapt to different computing architectures (HPC/cloud/etc)
Many bioinformaticists have used workflow managers to process and manage hundreds or thousands of files at a time. They are well worth the extra effort it takes to learn.
Here is an overview of some of the common bioinformatics workflow managers. We will be using miniWDL, which runs WDL files.
| Manager Software | Workflow Formats | Notes |
|---|---|---|
| Cromwell | WDL/CWL | Made for HPC Jobs |
| Sprocket | WDL | Made for HPC Jobs |
| MiniWDL | WDL | Used for local testing of workflows |
| DNANexus | WDL/CWL | Used for systems such as AllOfUs |
| Nextflow | .nf files |
Owned by seqera |
| Snakemake | make files |
9.3.1 Grabbing a WDL workflow from GETWILDS
git clone https://github.com/getwilds/ww-star-deseq2/9.3.2 Executing a WDL workflow
Say someone has given us a WDL file - how do we set it up to run on our own data?
In order to test this out on a batch of files, we’ll first request a node using grabnode (Section 10.1.1).
grabnodeIn our node, we’ll use miniWDL to run our WDL workflow. It is accessible via the cirro module.
module load cirro
which miniwdlWe will get the repsonse:
/app/software/cirro/1.7.0-foss-2024a/bin/miniwdl
We’ll investigate using miniwdl run as an initial way to interact with miniwdl.
miniwdl run
java -jar $EBROOTCROMWELL/cromwell.jar run ww-star2-deseq2.wdl \
--inputs input_file.json9.3.3 Input Files as JSON
If you want to work with the JSON format (Section 8.1), there is a trick to generating a template .json file for a workflow. There is an additional utility called womtools that will generate a template file from a workflow file.
miniwdl input-template ww-star2-deseq2.wdl > ww-star2-deseq2-inputs.jsonThis will generate a file called ww-star2-deseq2-inputs.json that contains all of the inputs:
{
"star_deseq2.samples": "Array[WomCompositeType {\n name -> String\nr1 -> File\nr2 -> File\ncondition -> String \n}]",
"star_deseq2.rnaseqc_cov.cpu_cores": "Int (optional, default = 2)",
...
}
This can be a good head start to making your .json files.
9.3.4 Working with file manifests in WDL
Another strategy you can use to cycle through a list of files is to use a file manifest to process a list of files.
There is an example workflow that shows how to work with a file manifest. This can be helpful for those who aren’t necessarily good at working with JSON.
This workflow has a single input, which is the location of the file manifest. It will then cycle through the manifest, line by line.
This is what the file manifest contains. Notice there are three named columns for this file: sampleName, bamLocation, and bedLocation.
| sampleName | bamLocation | bedLocation |
|---|---|---|
| smallTestData | /fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/unpaired-panel-consensus-variants-human/smallTestData.unmapped.bam | /fh/fast/paguirigan_a/pub/ReferenceDataSets/reagent_specific_data/sequencing_panel_bed/TruSight-Myeloid-Illumina/trusight-myeloid-amplicon-v4-track_interval-liftOver-hg38.bed |
| smallTestData-reference | /fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/paired-panel-consensus-variants-human/smallTestData-reference.unmapped.bam | /fh/fast/paguirigan_a/pub/ReferenceDataSets/reagent_specific_data/sequencing_panel_bed/TruSight-Myeloid-Illumina/trusight-myeloid-amplicon-v4-track_interval-liftOver-hg38.bed |
Here’s the workflow part of the file:
version 1.0
#### WORKFLOW DEFINITION
workflow ParseBatchFile {
input {
File batch_file
}
1 Array[Object] batch_info = read_objects(batch_file)
2 scatter (job in batch_info){
3 String sample_name = job.sampleName
4 File bam_file = job.bamLocation
5 File bed_file = job.bedLocation
## INSERT YOUR WORKFLOW TO RUN PER LINE IN YOUR BATCH FILE HERE!!!!
6 call Test {
input: in1=sample_name, in2=bam_file, in3=bed_file
}
} # End Scatter over the batch file
# Outputs that will be retained when execution is complete
output {
Array[File] output_array = Test.item_out
}
parameter_meta {
batch_file: "input tsv containing details about each sample in the batch"
output_array: "array containing details about each samples in the batch"
}
} # End workflow- 1
-
Read in file manifest line by line, and store in the array called
batch_info. - 2
-
Cycle through the manifest line by line,
scattering to multiple nodes - 3
-
Get
sample_nameinput fromjob.sample_name - 4
-
Get
bam_fileinput fromjob.bam_file - 5
-
Get
bed_fileinput fromjob.bed_file - 6
- Do something with the inputs.
9.3.5 Let’s start with WDL Tasks
We will start with the lower level of abstraction in WDL: The task.
9.3.6 Anatomy of a Task
task build_star_index {
meta {
...
}
parameter_meta {
reference_fasta: "Reference genome FASTA file"
reference_gtf: "Reference genome GTF annotation file"
sjdb_overhang: "Length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database"
genome_sa_index_nbases: "Length (bases) of the SA pre-indexing string, typically between 10-15 (scales with genome size)"
memory_gb: "Memory allocated for the task in GB"
cpu_cores: "Number of CPU cores allocated for the task"
}
1 input {
File reference_fasta
File reference_gtf
Int sjdb_overhang = 100
Int genome_sa_index_nbases = 14
Int memory_gb = 64
Int cpu_cores = 8
}
2 command <<<
set -eo pipefail
mkdir star_index
echo "Building STAR index..."
STAR \
--runMode genomeGenerate \
--runThreadN ~{cpu_cores} \
--genomeDir star_index \
--genomeFastaFiles "~{reference_fasta}" \
--sjdbGTFfile "~{reference_gtf}" \
--sjdbOverhang ~{sjdb_overhang} \
--genomeSAindexNbases ~{genome_sa_index_nbases}
tar -czf star_index.tar.gz star_index/
>>>
3 output {
File star_index_tar = "star_index.tar.gz"
}
4 runtime {
docker: "getwilds/star:2.7.6a"
memory: "~{memory_gb} GB"
cpu: cpu_cores
}
}- 1
- Input for our task.
- 2
- Bash commands to execute in task
- 3
- Description of output
- 4
-
Runtime requirements for execution. This is a lot like the
#SBATCHdirectives.
Everything between the <<< and >>> is essentially a bash script. WDL has its own variables
9.3.7 Architecture of a WDL file
The best way to read WDL files is to read them top down. We’ll focus on the basic sections of a WDL file before we see how they work together.
The code below is from the WILDs WDL Repo.
workflow SRA_STAR2Pass {
meta{
...
}
parameter_meta{
...
}
1 input {
...
}
2 if (!defined(reference_genome)) {
call download_reference {}
}
RefGenome ref_genome_final = select_first([reference_genome, download_reference.genome])
3 call build_star_index {
...
}
# Outputs that will be retained when execution is complete
4 output {
...
}
} - 1
- Inputs for workflow
- 2
- If/then statement of workflow
- 3
- call a task
- 4
- Outputs of workflow
Let’s go through each of these in detail (we’ll get back to the meta and parameters-meta sections)
The structure of the workflow
workflow SRA_STAR2Pass {
input {
Array[SampleInfo] samples
RefGenome? reference_genome
String reference_level = ""
String contrast = ""
}
if (!defined(reference_genome)) {
call download_reference {}
}
RefGenome ref_genome_final = select_first([reference_genome, download_reference.genome])
call build_star_index { input:
reference_fasta = ref_genome_final.fasta,
reference_gtf = ref_genome_final.gtf
}
# Outputs that will be retained when execution is complete
output {
...
}
} 9.4 Where Next?
Now that you understand the basics of working with Bash and WDL, you are now ready to start working with WDL workflows.