9 Workflows

9.1 Learning Objectives

Execute a workflow on the Fred Hutch Cluster
Modify a workflow and test it on a compute node
Utilize a container and test inside it on a compute node

9.2 Using a Container

In Section 8.2, we learned a little bit about using Apptainer to run a Docker container. Let’s try to pull a common container, the Genome Analysis Toolkit (GATK) and run things inside the container.

The first thing we need to do is load Apptainer:

module load Apptainer/1.1.6

Then we can pull the docker container:

apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1

We can check if we have pulled the docker image by using

apptainer cache list

Okay, now we have confirmed that we downloaded the apptainer image. Now we can try to execute things with it.

apptainer run \
1    --bind --bind /path/to/data:/data \
2    docker://biocontainers/samtools:v1.9-4-deb_cv1 \
3    samtools view -c /mydata/$1 > /mydata/$1.counts.txt

1: Bind path
2: Docker image we have downloaded
3: samtools command to run.

It’s worth trying this once to make sure you understand how all of the pieces are connected. In general, I do recommend using a workflow runner (Section 9.3) instead, because it helps manage all of these details

9.3 Running Workflows

So far, we’ve been batch processing using job arrays in SLURM. However, when you graduate to processing in multiple steps, you should consider using a workflow manager for processing files.

A workflow manager will take a workflow, which can consist of multiple steps and allow you to batch process files.

A good workflow manager will allow you to:

Restart failed subjobs in the workflow
Allow you to customize where intermediate and final outputs go
Swap and customize modules in your workflow
Adapt to different computing architectures (HPC/cloud/etc)

Many bioinformaticists have used workflow managers to process and manage hundreds or thousands of files at a time. They are well worth the extra effort it takes to learn.

Here is an overview of some of the common bioinformatics workflow managers. We will be using miniWDL, which runs WDL files.

Manager Software	Workflow Formats	Notes
Cromwell	WDL/CWL	Made for HPC Jobs
Sprocket	WDL	Made for HPC Jobs
MiniWDL	WDL	Used for local testing of workflows
DNANexus	WDL/CWL	Used for systems such as AllOfUs
Nextflow	`.nf` files	Owned by seqera
Snakemake	make files

9.3.1 Grabbing a WDL workflow from GETWILDS

git clone https://github.com/getwilds/ww-star-deseq2/

9.3.2 Executing a WDL workflow

Say someone has given us a WDL file - how do we set it up to run on our own data?

In order to test this out on a batch of files, we’ll first request a node using grabnode (Section 10.1.1).

grabnode

In our node, we’ll use miniWDL to run our WDL workflow. It is accessible via the cirro module.

module load cirro
which miniwdl

We will get the repsonse:

/app/software/cirro/1.7.0-foss-2024a/bin/miniwdl

We’ll investigate using miniwdl run as an initial way to interact with miniwdl.

miniwdl run 
java -jar $EBROOTCROMWELL/cromwell.jar run ww-star2-deseq2.wdl \
   --inputs input_file.json

9.3.3 Input Files as JSON

If you want to work with the JSON format (Section 8.1), there is a trick to generating a template .json file for a workflow. There is an additional utility called womtools that will generate a template file from a workflow file.

miniwdl input-template ww-star2-deseq2.wdl > ww-star2-deseq2-inputs.json

This will generate a file called ww-star2-deseq2-inputs.json that contains all of the inputs:

{
  "star_deseq2.samples": "Array[WomCompositeType {\n name -> String\nr1 -> File\nr2 -> File\ncondition -> String \n}]",
  "star_deseq2.rnaseqc_cov.cpu_cores": "Int (optional, default = 2)",
  ...
}

This can be a good head start to making your .json files.

9.3.4 Working with file manifests in WDL

Another strategy you can use to cycle through a list of files is to use a file manifest to process a list of files.

There is an example workflow that shows how to work with a file manifest. This can be helpful for those who aren’t necessarily good at working with JSON.

This workflow has a single input, which is the location of the file manifest. It will then cycle through the manifest, line by line.

This is what the file manifest contains. Notice there are three named columns for this file: sampleName, bamLocation, and bedLocation.

sampleName	bamLocation	bedLocation
smallTestData	/fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/unpaired-panel-consensus-variants-human/smallTestData.unmapped.bam	/fh/fast/paguirigan_a/pub/ReferenceDataSets/reagent_specific_data/sequencing_panel_bed/TruSight-Myeloid-Illumina/trusight-myeloid-amplicon-v4-track_interval-liftOver-hg38.bed
smallTestData-reference	/fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/paired-panel-consensus-variants-human/smallTestData-reference.unmapped.bam	/fh/fast/paguirigan_a/pub/ReferenceDataSets/reagent_specific_data/sequencing_panel_bed/TruSight-Myeloid-Illumina/trusight-myeloid-amplicon-v4-track_interval-liftOver-hg38.bed

Here’s the workflow part of the file:

version 1.0
#### WORKFLOW DEFINITION

workflow ParseBatchFile {
  input {
    File batch_file
  }

1  Array[Object] batch_info = read_objects(batch_file)

2  scatter (job in batch_info){
3    String sample_name = job.sampleName
4    File bam_file = job.bamLocation
5    File bed_file = job.bedLocation

    ## INSERT YOUR WORKFLOW TO RUN PER LINE IN YOUR BATCH FILE HERE!!!!
6    call Test {
      input: in1=sample_name, in2=bam_file, in3=bed_file
    }
  }  # End Scatter over the batch file

  # Outputs that will be retained when execution is complete
  output {
    Array[File] output_array = Test.item_out
  }

  parameter_meta {
    batch_file: "input tsv containing details about each sample in the batch"
    output_array: "array containing details about each samples in the batch"
  }
} # End workflow

1: Read in file manifest line by line, and store in the array called batch_info.
2: Cycle through the manifest line by line, scattering to multiple nodes
3: Get sample_name input from job.sample_name
4: Get bam_file input from job.bam_file
5: Get bed_file input from job.bed_file
6: Do something with the inputs.

9.3.5 Let’s start with WDL Tasks

We will start with the lower level of abstraction in WDL: The task.

9.3.6 Anatomy of a Task

task build_star_index {
  meta {
      ...
  }

  parameter_meta {
    reference_fasta: "Reference genome FASTA file"
    reference_gtf: "Reference genome GTF annotation file"
    sjdb_overhang: "Length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database"
    genome_sa_index_nbases: "Length (bases) of the SA pre-indexing string, typically between 10-15 (scales with genome size)"
    memory_gb: "Memory allocated for the task in GB"
    cpu_cores: "Number of CPU cores allocated for the task"
  }

1  input {
    File reference_fasta
    File reference_gtf
    Int sjdb_overhang = 100
    Int genome_sa_index_nbases = 14
    Int memory_gb = 64
    Int cpu_cores = 8
  }

2  command <<<
    set -eo pipefail
    
    mkdir star_index

    echo "Building STAR index..."
    STAR \
      --runMode genomeGenerate \
      --runThreadN ~{cpu_cores} \
      --genomeDir star_index \
      --genomeFastaFiles "~{reference_fasta}" \
      --sjdbGTFfile "~{reference_gtf}" \
      --sjdbOverhang ~{sjdb_overhang} \
      --genomeSAindexNbases ~{genome_sa_index_nbases}

    tar -czf star_index.tar.gz star_index/
  >>>

3  output {
    File star_index_tar = "star_index.tar.gz"
  }

4  runtime {
    docker: "getwilds/star:2.7.6a"
    memory: "~{memory_gb} GB"
    cpu: cpu_cores
  }
}

1: Input for our task.
2: Bash commands to execute in task
3: Description of output
4: Runtime requirements for execution. This is a lot like the #SBATCH directives.

Everything between the <<< and >>> is essentially a bash script. WDL has its own variables

9.3.7 Architecture of a WDL file

The best way to read WDL files is to read them top down. We’ll focus on the basic sections of a WDL file before we see how they work together.

The code below is from the WILDs WDL Repo.

workflow SRA_STAR2Pass {
   meta{
   ...
   }

   parameter_meta{
   ...
   }

1  input {
    ...
  }

2  if (!defined(reference_genome)) {
    call download_reference {}
  }

  RefGenome ref_genome_final = select_first([reference_genome, download_reference.genome])

3  call build_star_index {
    ...
  }

  # Outputs that will be retained when execution is complete
4  output {
    ...
  }

}

1: Inputs for workflow
2: If/then statement of workflow
3: call a task
4: Outputs of workflow

Let’s go through each of these in detail (we’ll get back to the meta and parameters-meta sections)

The structure of the workflow

workflow SRA_STAR2Pass {
  input {
    Array[SampleInfo] samples
    RefGenome? reference_genome
    String reference_level = ""
    String contrast = ""
  }

  if (!defined(reference_genome)) {
    call download_reference {}
  }

  RefGenome ref_genome_final = select_first([reference_genome, download_reference.genome])

  call build_star_index { input:
      reference_fasta = ref_genome_final.fasta,
      reference_gtf = ref_genome_final.gtf
  }

  # Outputs that will be retained when execution is complete
  output {
    ...
  }

}

9.4 Where Next?

Now that you understand the basics of working with Bash and WDL, you are now ready to start working with WDL workflows.