6  Workflows

6.1 Learning Objectives

  • Execute a workflow on the Fred Hutch Cluster
  • Modify a workflow and test it on a compute node
  • Utilize a container and test inside it on a compute node

6.2 Running Workflows

So far, we’ve been batch processing using job arrays in SLURM. However, when you graduate to processing in multiple steps, you should consider using a workflow manager for processing files.

A workflow manager will take a workflow, which can consist of multiple steps and allow you to batch process files.

A good workflow manager will allow you to:

  • Restart failed subjobs in the workflow
  • Allow you to customize where intermediate and final outputs go
  • Swap and customize modules in your workflow
  • Adapt to different computing architectures (HPC/cloud/etc)

Many bioinformaticists have used workflow managers to process and manage hundreds or thousands of files at a time. They are well worth learning.

Here is an overview of some of the common bioinformatics workflow managers. We will be using cromwell, which runs WDL files.

Manager Software Workflow Formats Notes
Cromwell WDL/CWL Made for HPC Jobs
Sprocket WDL Made for HPC Jobs
MiniWDL WDL Used for local testing of workflows
DNANexus WDL/CWL Used for systems such as AllOfUs
Nextflow .nf files Owned by seqera
Snakemake make files

6.2.1 Grabbing a WDL workflow from GETWILDS

git clone https://github.com/getwilds/ww-star-deseq2/

6.2.2 Executing a WDL workflow

Say someone has given us a WDL file - how do we set it up to run on our own data?

We’ll use cromwell to run our WDL workflow.

module load cromwell/87

We will get the repsonse:

To execute cromwell, run: java -jar $EBROOTCROMWELL/cromwell.jar

To execute womtool, run: java -jar $EBROOTCROMWELL/womtool.jar

We’ll investigate using cromwell run as an initial way to interact with cromwell.

java -jar $EBROOTCROMWELL/cromwell.jar run ww-star2-deseq2.wdl \
   --inputs input_file.json
cromwell serve

The other way to run Cromwell is to run it in server mode. You can start the server by using:

java -jar $EBROOTCROMWELL/cromwell.jar serve

While you can run the server on the command line, I don’t recommend it. I would check out the PROOF application for a GUI based way to execute workflows. It handles cromwell server management automatically.

6.2.3 Input Files as JSON

If you want to work with the JSON format (Section 5.2), there is a trick to generating a template .json file for a workflow. There is an additional utility called womtools that will generate a template file from a workflow file.

java -jar $EBROOTCROMWELL/womtool.jar inputs ww-star2-deseq2.wdl > ww-star2-deseq2-inputs.json

This will generate a file called ww-star2-deseq2-inputs.json that contains all of the inputs:

{
  "star_deseq2.samples": "Array[WomCompositeType {\n name -> String\nr1 -> File\nr2 -> File\ncondition -> String \n}]",
  "star_deseq2.rnaseqc_cov.cpu_cores": "Int (optional, default = 2)",
  ...
}

This can be a good head start to making your .json files.

6.2.4 Working with file manifests in WDL

Last week, we worked with a file manifest to process a list of files.

There is an example workflow that shows how to work with a file manifest. This can be helpful for those who aren’t necessarily good at working with JSON.

This workflow has a single input, which is the location of the file manifest. It will then cycle through the manifest, line by line.

This is what the file manifest contains. Notice there are three named columns for this file: sampleName, bamLocation, and bedLocation.

sampleName bamLocation bedLocation
smallTestData /fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/unpaired-panel-consensus-variants-human/smallTestData.unmapped.bam /fh/fast/paguirigan_a/pub/ReferenceDataSets/reagent_specific_data/sequencing_panel_bed/TruSight-Myeloid-Illumina/trusight-myeloid-amplicon-v4-track_interval-liftOver-hg38.bed
smallTestData-reference /fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/paired-panel-consensus-variants-human/smallTestData-reference.unmapped.bam /fh/fast/paguirigan_a/pub/ReferenceDataSets/reagent_specific_data/sequencing_panel_bed/TruSight-Myeloid-Illumina/trusight-myeloid-amplicon-v4-track_interval-liftOver-hg38.bed

Here’s the workflow part of the file:

version 1.0
#### WORKFLOW DEFINITION

workflow ParseBatchFile {
  input {
    File batch_file
  }

1  Array[Object] batch_info = read_objects(batch_file)

2  scatter (job in batch_info){
3    String sample_name = job.sampleName
4    File bam_file = job.bamLocation
5    File bed_file = job.bedLocation

    ## INSERT YOUR WORKFLOW TO RUN PER LINE IN YOUR BATCH FILE HERE!!!!
6    call Test {
      input: in1=sample_name, in2=bam_file, in3=bed_file
    }
  }  # End Scatter over the batch file

  # Outputs that will be retained when execution is complete
  output {
    Array[File] output_array = Test.item_out
  }

  parameter_meta {
    batch_file: "input tsv containing details about each sample in the batch"
    output_array: "array containing details about each samples in the batch"
  }
} # End workflow
1
Read in file manifest line by line, and store in the array called batch_info.
2
Cycle through the manifest line by line, scattering to multiple nodes
3
Get sample_name input from job.sample_name
4
Get bam_file input from job.bam_file
5
Get bed_file input from job.bed_file
6
Do something with the inputs.

6.2.5 Let’s start with WDL Tasks

We will start with the lower level of abstraction in WDL: The task.

6.2.6 Anatomy of a Task

task build_star_index {
  meta {
      ...
  }

  parameter_meta {
    reference_fasta: "Reference genome FASTA file"
    reference_gtf: "Reference genome GTF annotation file"
    sjdb_overhang: "Length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database"
    genome_sa_index_nbases: "Length (bases) of the SA pre-indexing string, typically between 10-15 (scales with genome size)"
    memory_gb: "Memory allocated for the task in GB"
    cpu_cores: "Number of CPU cores allocated for the task"
  }

1  input {
    File reference_fasta
    File reference_gtf
    Int sjdb_overhang = 100
    Int genome_sa_index_nbases = 14
    Int memory_gb = 64
    Int cpu_cores = 8
  }

2  command <<<
    set -eo pipefail
    
    mkdir star_index

    echo "Building STAR index..."
    STAR \
      --runMode genomeGenerate \
      --runThreadN ~{cpu_cores} \
      --genomeDir star_index \
      --genomeFastaFiles "~{reference_fasta}" \
      --sjdbGTFfile "~{reference_gtf}" \
      --sjdbOverhang ~{sjdb_overhang} \
      --genomeSAindexNbases ~{genome_sa_index_nbases}

    tar -czf star_index.tar.gz star_index/
  >>>

3  output {
    File star_index_tar = "star_index.tar.gz"
  }

4  runtime {
    docker: "getwilds/star:2.7.6a"
    memory: "~{memory_gb} GB"
    cpu: cpu_cores
  }
}
1
Input for our task.
2
Bash commands to execute in task
3
Description of output
4
Runtime requirements for execution. This is a lot like the #SBATCH directives.

Everything between the <<< and >>> is essentially a bash script. WDL has its own variables

6.2.7 Architecture of a WDL file

The best way to read WDL files is to read them top down. We’ll focus on the basic sections of a WDL file before we see how they work together.

The code below is from the WILDs WDL Repo.

workflow SRA_STAR2Pass {
   meta{
   ...
   }

   parameter_meta{
   ...
   }

1  input {
    ...
  }

2  if (!defined(reference_genome)) {
    call download_reference {}
  }

  RefGenome ref_genome_final = select_first([reference_genome, download_reference.genome])

3  call build_star_index {
    ...
  }

  # Outputs that will be retained when execution is complete
4  output {
    ...
  }

} 
1
Inputs for workflow
2
If/then statement of workflow
3
call a task
4
Outputs of workflow

Let’s go through each of these in detail (we’ll get back to the meta and parameters-meta sections)

The structure of the workflow

workflow SRA_STAR2Pass {
  input {
    Array[SampleInfo] samples
    RefGenome? reference_genome
    String reference_level = ""
    String contrast = ""
  }

  if (!defined(reference_genome)) {
    call download_reference {}
  }

  RefGenome ref_genome_final = select_first([reference_genome, download_reference.genome])

  call build_star_index { input:
      reference_fasta = ref_genome_final.fasta,
      reference_gtf = ref_genome_final.gtf
  }

  # Outputs that will be retained when execution is complete
  output {
    ...
  }

} 

6.3 Where Next?

Now that you understand the basics of working with Bash and WDL, you are now ready to start working with WDL workflows.