6 Workflows
6.1 Learning Objectives
- Execute a workflow on the Fred Hutch Cluster
- Modify a workflow and test it on a compute node
- Utilize a container and test inside it on a compute node
6.2 Running Workflows
So far, we’ve been batch processing using job arrays in SLURM. However, when you graduate to processing in multiple steps, you should consider using a workflow manager for processing files.
A workflow manager will take a workflow, which can consist of multiple steps and allow you to batch process files.
A good workflow manager will allow you to:
- Restart failed subjobs in the workflow
- Allow you to customize where intermediate and final outputs go
- Swap and customize modules in your workflow
- Adapt to different computing architectures (HPC/cloud/etc)
Many bioinformaticists have used workflow managers to process and manage hundreds or thousands of files at a time. They are well worth learning.
Here is an overview of some of the common bioinformatics workflow managers. We will be using cromwell
, which runs WDL files.
Manager Software | Workflow Formats | Notes |
---|---|---|
Cromwell | WDL/CWL | Made for HPC Jobs |
Sprocket | WDL | Made for HPC Jobs |
MiniWDL | WDL | Used for local testing of workflows |
DNANexus | WDL/CWL | Used for systems such as AllOfUs |
Nextflow | .nf files |
Owned by seqera |
Snakemake | make files |
6.2.1 Grabbing a WDL workflow from GETWILDS
git clone https://github.com/getwilds/ww-star-deseq2/
6.2.2 Executing a WDL workflow
Say someone has given us a WDL file - how do we set it up to run on our own data?
We’ll use cromwell
to run our WDL workflow.
module load cromwell/87
We will get the repsonse:
To execute cromwell, run: java -jar $EBROOTCROMWELL/cromwell.jar
To execute womtool, run: java -jar $EBROOTCROMWELL/womtool.jar
We’ll investigate using cromwell run
as an initial way to interact with cromwell
.
java -jar $EBROOTCROMWELL/cromwell.jar run ww-star2-deseq2.wdl \
--inputs input_file.json
6.2.3 Input Files as JSON
If you want to work with the JSON format (Section 5.2), there is a trick to generating a template .json
file for a workflow. There is an additional utility called womtools
that will generate a template file from a workflow file.
java -jar $EBROOTCROMWELL/womtool.jar inputs ww-star2-deseq2.wdl > ww-star2-deseq2-inputs.json
This will generate a file called ww-star2-deseq2-inputs.json
that contains all of the inputs:
{
"star_deseq2.samples": "Array[WomCompositeType {\n name -> String\nr1 -> File\nr2 -> File\ncondition -> String \n}]",
"star_deseq2.rnaseqc_cov.cpu_cores": "Int (optional, default = 2)",
...
}
This can be a good head start to making your .json
files.
6.2.4 Working with file manifests in WDL
Last week, we worked with a file manifest to process a list of files.
There is an example workflow that shows how to work with a file manifest. This can be helpful for those who aren’t necessarily good at working with JSON.
This workflow has a single input, which is the location of the file manifest. It will then cycle through the manifest, line by line.
This is what the file manifest contains. Notice there are three named columns for this file: sampleName
, bamLocation
, and bedLocation
.
sampleName | bamLocation | bedLocation |
---|---|---|
smallTestData | /fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/unpaired-panel-consensus-variants-human/smallTestData.unmapped.bam | /fh/fast/paguirigan_a/pub/ReferenceDataSets/reagent_specific_data/sequencing_panel_bed/TruSight-Myeloid-Illumina/trusight-myeloid-amplicon-v4-track_interval-liftOver-hg38.bed |
smallTestData-reference | /fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/paired-panel-consensus-variants-human/smallTestData-reference.unmapped.bam | /fh/fast/paguirigan_a/pub/ReferenceDataSets/reagent_specific_data/sequencing_panel_bed/TruSight-Myeloid-Illumina/trusight-myeloid-amplicon-v4-track_interval-liftOver-hg38.bed |
Here’s the workflow part of the file:
version 1.0
#### WORKFLOW DEFINITION
workflow ParseBatchFile {
input {
File batch_file
}
1Array[Object] batch_info = read_objects(batch_file)
2scatter (job in batch_info){
3String sample_name = job.sampleName
4File bam_file = job.bamLocation
5File bed_file = job.bedLocation
## INSERT YOUR WORKFLOW TO RUN PER LINE IN YOUR BATCH FILE HERE!!!!
6call Test {
input: in1=sample_name, in2=bam_file, in3=bed_file
}
} # End Scatter over the batch file
# Outputs that will be retained when execution is complete
output {
Array[File] output_array = Test.item_out
}
parameter_meta {
batch_file: "input tsv containing details about each sample in the batch"
output_array: "array containing details about each samples in the batch"
}
} # End workflow
- 1
-
Read in file manifest line by line, and store in the array called
batch_info
. - 2
-
Cycle through the manifest line by line,
scatter
ing to multiple nodes - 3
-
Get
sample_name
input fromjob.sample_name
- 4
-
Get
bam_file
input fromjob.bam_file
- 5
-
Get
bed_file
input fromjob.bed_file
- 6
- Do something with the inputs.
6.2.5 Let’s start with WDL Tasks
We will start with the lower level of abstraction in WDL: The task.
6.2.6 Anatomy of a Task
task build_star_index {
meta {
...
}
parameter_meta {
reference_fasta: "Reference genome FASTA file"
reference_gtf: "Reference genome GTF annotation file"
sjdb_overhang: "Length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database"
genome_sa_index_nbases: "Length (bases) of the SA pre-indexing string, typically between 10-15 (scales with genome size)"
memory_gb: "Memory allocated for the task in GB"
cpu_cores: "Number of CPU cores allocated for the task"
}
1input {
File reference_fasta
File reference_gtf
Int sjdb_overhang = 100
Int genome_sa_index_nbases = 14
Int memory_gb = 64
Int cpu_cores = 8
}
2command <<<
set -eo pipefail
mkdir star_index
echo "Building STAR index..."
STAR \
--runMode genomeGenerate \
--runThreadN ~{cpu_cores} \
--genomeDir star_index \
--genomeFastaFiles "~{reference_fasta}" \
--sjdbGTFfile "~{reference_gtf}" \
--sjdbOverhang ~{sjdb_overhang} \
--genomeSAindexNbases ~{genome_sa_index_nbases}
tar -czf star_index.tar.gz star_index/
>>>
3output {
File star_index_tar = "star_index.tar.gz"
}
4runtime {
docker: "getwilds/star:2.7.6a"
memory: "~{memory_gb} GB"
cpu: cpu_cores
}
}
- 1
- Input for our task.
- 2
- Bash commands to execute in task
- 3
- Description of output
- 4
-
Runtime requirements for execution. This is a lot like the
#SBATCH
directives.
Everything between the <<<
and >>>
is essentially a bash script. WDL has its own variables
6.2.7 Architecture of a WDL file
The best way to read WDL files is to read them top down. We’ll focus on the basic sections of a WDL file before we see how they work together.
The code below is from the WILDs WDL Repo.
workflow SRA_STAR2Pass {
meta{
...
}
parameter_meta{
...
}
1input {
...
}
2if (!defined(reference_genome)) {
call download_reference {}
}
RefGenome ref_genome_final = select_first([reference_genome, download_reference.genome])
3call build_star_index {
...
}
# Outputs that will be retained when execution is complete
4output {
...
}
}
- 1
- Inputs for workflow
- 2
- If/then statement of workflow
- 3
- call a task
- 4
- Outputs of workflow
Let’s go through each of these in detail (we’ll get back to the meta
and parameters-meta
sections)
The structure of the workflow
workflow SRA_STAR2Pass {
input {
Array[SampleInfo] samples
RefGenome? reference_genome
String reference_level = ""
String contrast = ""
}
if (!defined(reference_genome)) {
call download_reference {}
}
RefGenome ref_genome_final = select_first([reference_genome, download_reference.genome])
call build_star_index { input:
reference_fasta = ref_genome_final.fasta,
reference_gtf = ref_genome_final.gtf
}
# Outputs that will be retained when execution is complete
output {
...
}
}
6.3 Where Next?
Now that you understand the basics of working with Bash and WDL, you are now ready to start working with WDL workflows.