Chapter 5 Organizing variables via Structs
In our workflow so far, we see that certain variables are always used together, even for different tasks. For example, variables related to the reference genome are always used for the same purpose and passed on to tasks in almost the same way. This leads to quite a bit of coding redundancy, as when we write down the large set of variables related to the reference genome as task inputs, we are just thinking about one entity. We don’t make distinctions of the reference genome files until the task body itself.
To improve code organization and readability, we can package all variables related to the reference genome into a compound data structure called a struct. With a struct variable, we can refer all the packaged variables as one single variable, and also refer to specific variables within the struct without losing any information. OpenWDL Docs also has an excellent introduction and examples on structs.
To define a struct, we must declare it outside of a workflow
and task
:
struct referenceGenome {
File ref_fasta
File ref_fasta_index
File ref_dict
File ref_amb
File ref_ann
File ref_bwt
File ref_pac
File ref_sa
String ref_name
}
workflow mutation_calling {
input {
File sampleFastq
referenceGenome refGenome ## our struct
...
}
# Map reads to reference
call BwaMem {
input:
input_fastq = sampleFastq,
refGenome = refGenome ## our struct
}
}
The referenceGenome
struct contains all the variables related to the reference genome, but values cannot be defined here. The struct definition merely lays the skeleton components of the data structure, but contains no actual values.
In our workflow inputs, we remove all of the File
variables associated with reference genome definitions, but keep anything that isn’t related to the reference genome, such as sampleFastq
. We instead declare a referenceGenome
struct variable called refGenome
via referenceGenome refGenome
. We can access the variables within a struct by the following syntax: structVar.varName
, such as refGenome.ref_name
. The WDL spec has more information on how to define and use structs.
To give values to refGenome
, we need to modify our JSON metadata file. We define the refGenome
variable in a nested structure that corresponds to the referenceGenome
struct. Let’s take a look:
{
"mutation_calling.refGenome": {
"ref_fasta": "/path/to/Homo_sapiens_assembly19.fasta",
"ref_fasta_index": "/path/to/Homo_sapiens_assembly19.fasta.fai",
"ref_dict": "/path/to/Homo_sapiens_assembly19.dict",
"ref_pac": "/path/to/Homo_sapiens_assembly19.fasta.pac",
"ref_sa": "/path/to/Homo_sapiens_assembly19.fasta.sa",
"ref_amb": "/path/to/Homo_sapiens_assembly19.fasta.amb",
"ref_ann": "/path/to/Homo_sapiens_assembly19.fasta.ann",
"ref_bwt": "/path/to/Homo_sapiens_assembly19.fasta.bwt",
"ref_name": "hg19"
},
"mutation_calling.dbSNP_vcf_index": "/path/to/dbsnp_138.b37.vcf.gz.tbi",
...
}
Now refGenome
has all the values it needs for our tasks.
In addition, we have replaced all the reference genome inputs in call BwaMem
with refGenome
in order to pass information to a task via structs.
Within the BwaMem
task, we must refer to variables inside the struct, such as refGenome.ref_name
(which has a value of “hg19” using this JSON metadata):
# Align fastq file to the reference genome
task BwaMem {
input {
File input_fastq
referenceGenome refGenome
}
String base_file_name = basename(input_fastq, ".fastq")
String ref_fasta_local = basename(refGenome.ref_fasta)
String read_group_id = "ID:" + base_file_name
String sample_name = "SM:" + base_file_name
String platform_info = "PL:illumina"
command <<<
set -eo pipefail
mv "~{refGenome.ref_fasta}" .
mv "~{refGenome.ref_fasta_index}" .
mv "~{refGenome.ref_dict}" .
mv "~{refGenome.ref_amb}" .
mv "~{refGenome.ref_ann}" .
mv "~{refGenome.ref_bwt}" .
mv "~{refGenome.ref_pac}" .
mv "~{refGenome.ref_sa}" .
bwa mem \
-p -v 3 -t 16 -M -R '@RG\t~{read_group_id}\t~{sample_name}\t~{platform_info}' \
"~{ref_fasta_local}" "~{input_fastq}" > "~{base_file_name}.sam"
samtools view -1bS -@ 15 -o "~{base_file_name}.aligned.bam" "~{base_file_name}.sam"
samtools sort -@ 15 -o "~{base_file_name}.sorted_query_aligned.bam" "~{base_file_name}.aligned.bam"
>>>
output {
File analysisReadySorted = "~{base_file_name}.sorted_query_aligned.bam"
}
runtime {
memory: "48 GB"
cpu: 16
docker: "ghcr.io/getwilds/bwa:0.7.17"
}
}
Other tasks in the workflow, such as ApplyBaseRecalibrator
and Mutect2TumorOnly
also make use of the reference genome, so we pass refGenome
to it. The final task annovar
only requires the reference genome name, and none of the files in the referenceGenome
struct. We make a stylistic choice to pass only refGenome.ref_name
to the input of annovar
task call, as the task doesn’t need the full information of the struct. This stylistic choice is based on the principle of passing on the minimally needed information for a modular piece of code to run, which makes the task annovar
depend on the minimal amount of inputs. This will also save us time and disk space by not having to localize several gigabytes of reference files into the Docker container that annovar
will be running in.
call annovar {
input:
input_vcf = Mutect2TumorOnly.output_vcf,
ref_name = refGenome.ref_name,
annovar_operation = annovar_operation,
annovar_protocols = annovar_protocols
}
Putting everything together in the workflow:
The JSON metadata:
{
"mutation_calling.sampleFastq": "/path/to/Tumor_2_EGFR_HCC4006_combined.fastq",
"mutation_calling.refGenome": {
"ref_fasta": "/path/to/Homo_sapiens_assembly19.fasta",
"ref_fasta_index": "/path/to/Homo_sapiens_assembly19.fasta.fai",
"ref_dict": "/path/to/Homo_sapiens_assembly19.dict",
"ref_pac": "/path/to/Homo_sapiens_assembly19.fasta.pac",
"ref_sa": "/path/to/Homo_sapiens_assembly19.fasta.sa",
"ref_amb": "/path/to/Homo_sapiens_assembly19.fasta.amb",
"ref_ann": "/path/to/Homo_sapiens_assembly19.fasta.ann",
"ref_bwt": "/path/to/Homo_sapiens_assembly19.fasta.bwt",
"ref_name": "hg19"
},
"mutation_calling.dbSNP_vcf_index": "/path/to/dbsnp_138.b37.vcf.gz.tbi",
"mutation_calling.dbSNP_vcf": "/path/to/dbsnp_138.b37.vcf.gz",
"mutation_calling.known_indels_sites_indices": "/path/to/Mills_and_1000G_gold_standard.indels.b37.sites.vcf.idx",
"mutation_calling.known_indels_sites_VCFs": "/path/to/Mills_and_1000G_gold_standard.indels.b37.sites.vcf",
"mutation_calling.af_only_gnomad": "/path/to/af-only-gnomad.raw.sites.b37.vcf.gz",
"mutation_calling.af_only_gnomad_index": "/path/to/af-only-gnomad.raw.sites.b37.vcf.gz.tbi",
"mutation_calling.annovar_protocols": "refGene,knownGene,cosmic70,esp6500siv2_all,clinvar_20180603,gnomad211_exome",
"mutation_calling.annovar_operation": "g,f,f,f,f,f"
}
The JSON using the Fred Hutch HPC
{
"mutation_calling.sampleFastq": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/wdl_101/HCC4006_final.fastq",
"mutation_calling.refGenome": {
"ref_fasta": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta",
"ref_fasta_index": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.fai",
"ref_dict": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.dict",
"ref_pac": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.pac",
"ref_sa": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.sa",
"ref_amb": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.amb",
"ref_ann": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.ann",
"ref_bwt": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.bwt",
"ref_name": "hg19"
},
"mutation_calling.dbSNP_vcf_index": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/dbsnp_138.b37.vcf.gz.tbi",
"mutation_calling.dbSNP_vcf": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/dbsnp_138.b37.vcf.gz",
"mutation_calling.known_indels_sites_indices": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Mills_and_1000G_gold_standard.indels.b37.sites.vcf.idx",
"mutation_calling.known_indels_sites_VCFs": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Mills_and_1000G_gold_standard.indels.b37.sites.vcf",
"mutation_calling.af_only_gnomad": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/af-only-gnomad.raw.sites.b37.vcf.gz",
"mutation_calling.af_only_gnomad_index": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/af-only-gnomad.raw.sites.b37.vcf.gz.tbi",
"mutation_calling.annovar_protocols": "refGene,knownGene,cosmic70,esp6500siv2_all,clinvar_20180603,gnomad211_exome",
"mutation_calling.annovar_operation": "g,f,f,f,f,f"
}