Chapter 4 Connecting multiple tasks together in a linear chain
Now that you have a first task in a workflow up and running, the next step is to continue building out the workflow as described in our Workflow Plan:
BwaMem
aligns the samples to the reference genome.MarkDuplicates
marks PCR duplicates.ApplyBaseRecalibrator
applies base recalibration.Mutect2
performs somatic mutation calling. For this current iteration, we start withMutect2TumorOnly
which only uses the tumor sample. Later on, we will switch to a version that will allow for comparing tumor samples against a non-tumor normal sample.annovar
annotates the called somatic mutations.
We do this via a linear chain, in which we feed the output of one task to the input of the next task. Let’s see how to build a linear chain in a workflow.
4.1 How to connect tasks together in a workflow
We can easily connect tasks together because of the following: The output variables of a task can be accessed at the workflow level as inputs for the subsequent task.
For instance, let’s see the output of our BwaMem
task.
output {
File analysisReadyBam = "~{base_file_name}.aligned.bam"
File analysisReadySorted = "~{base_file_name}.sorted_query_aligned.bam"
}
The File variables analysisReadyBam
and analysisReadySorted
can now be accessed anywhere in the workflow block after the BwaMem
task as BwaMem.analysisReadyBam
and BwaMem.analysisReadySorted
, respectively. Therefore, when we call the MarkDuplicates
task, we can pass it the input BwaMem.analysisReadySorted
from the BwaMem
task:
workflow mutation_calling {
input {
File sampleFastq
# Reference genome
File ref_fasta
File ref_fasta_index
File ref_dict
File ref_amb
File ref_ann
File ref_bwt
File ref_pac
File ref_sa
}
# Map reads to reference
call BwaMem {
input:
input_fastq = sampleFastq,
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
ref_dict = ref_dict,
ref_amb = ref_amb,
ref_ann = ref_ann,
ref_bwt = ref_bwt,
ref_pac = ref_pac,
ref_sa = ref_sa
}
call MarkDuplicates {
input:
input_bam = BwaMem.analysisReadySorted
}
}
Resources:
- For a basic introduction to linear chain, see OpenWDL Docs’ introduction.
- To see other examples of linear chain and variations, see OpenWDL Docs’s section on workflow plumbing.
4.2 Writing MarkDuplicates
task
Of course, the task MarkDuplicates
hasn’t been written yet! Let’s go through it together:
4.2.1 Input
The task takes an input bam file that has been aligned to the reference genome. It needs to be a File input_bam
based on how we introduced it in the workflow above. That is easy to write up:
task MarkDuplicates {
input {
File input_bam
}
}
4.2.2 Private variables in the task
Similar to the BwaMem
task, we will name our output files based on the base name of the original input file in the workflow. Therefore, it makes sense to create a private String variable base_file_name
that contains this base name of input_bam
. We will use base_file_name
in the Command section to specify the output bam and metrics files.
task MarkDuplicates {
input {
File input_bam
}
String base_file_name = basename(input_bam, ".sorted_query_aligned.bam")
}
4.2.3 Command
As we have been doing all along, we refer to any defined variables from the input or private variables using ~{this} syntax.
task MarkDuplicates {
input {
File input_bam
}
String base_file_name = basename(input_bam, ".sorted_query_aligned.bam")
command <<<
gatk MarkDuplicates \
--INPUT "~{input_bam}" \
--OUTPUT "~{base_file_name}.duplicates_marked.bam" \
--METRICS_FILE "~{base_file_name}.duplicate_metrics" \
--CREATE_INDEX true \
--OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 \
--VALIDATION_STRINGENCY SILENT
>>>
}
4.2.4 Runtime and Output
We specify a different Docker image that contains the GATK software, and the relevant computing needs. We also specify three different output files, two of which are specified in the command section, and the third is a bam index file that is automatically created by the command section.
Below is the task all together. It has a form very similar to our first task BwaMem
.
task MarkDuplicates {
input {
File input_bam
}
String base_file_name = basename(input_bam, ".sorted_query_aligned.bam")
command <<<
gatk MarkDuplicates \
--INPUT "~{input_bam}" \
--OUTPUT "~{base_file_name}.duplicates_marked.bam" \
--METRICS_FILE "~{base_file_name}.duplicate_metrics" \
--CREATE_INDEX true \
--OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 \
--VALIDATION_STRINGENCY SILENT
>>>
runtime {
docker: "broadinstitute/gatk:4.1.4.0"
memory: "48 GB"
cpu: 4
}
output {
File markDuplicates_bam = "~{base_file_name}.duplicates_marked.bam"
File markDuplicates_bai = "~{base_file_name}.duplicates_marked.bai"
File duplicate_metrics = "~{base_file_name}.duplicate_metrics"
}
}
4.2.5 Testing the workflow
As before, when you add a new task to the workflow, you should always test that it works on your test sample. To check that MarkDuplicates
is indeed marking PCR duplicates, you could check for the presence of the PCR duplicate flag in reads, which has a decimal value of 1024 in the SAM Flags Field.
4.3 The rest of the linear chain workflow
We build out the rest of the tasks in a very similar fashion. Tasks ApplyBaseRecalibrator
and Mutect2TumorOnly
both have files that need to be localized, but otherwise all the tasks have a very similar form as BwaMem
and MarkDuplicates
. For this current iteration, we use only the tumor sample for mutation calling. In the following chapters, we will use additional WDL features to make use of tumor and normal samples for mutation calling.
We also expand our input JSON metadata to have files needed for each task:
{
"mutation_calling.sampleFastq": "/path/to/Tumor_2_EGFR_HCC4006_combined.fastq",
"mutation_calling.ref_fasta": "/path/to/Homo_sapiens_assembly19.fasta",
"mutation_calling.ref_fasta_index": "/path/to/Homo_sapiens_assembly19.fasta.fai",
"mutation_calling.ref_dict": "/path/to/Homo_sapiens_assembly19.dict",
"mutation_calling.ref_pac": "/path/to/Homo_sapiens_assembly19.fasta.pac",
"mutation_calling.ref_sa": "/path/to/Homo_sapiens_assembly19.fasta.sa",
"mutation_calling.ref_amb": "/path/to/Homo_sapiens_assembly19.fasta.amb",
"mutation_calling.ref_ann": "/path/to/Homo_sapiens_assembly19.fasta.ann",
"mutation_calling.ref_bwt": "/path/to/Homo_sapiens_assembly19.fasta.bwt",
"mutation_calling.ref_name": "hg19",
"mutation_calling.dbSNP_vcf_index": "/path/to/dbsnp_138.b37.vcf.gz.tbi",
"mutation_calling.dbSNP_vcf": "/path/to/dbsnp_138.b37.vcf.gz",
"mutation_calling.known_indels_sites_indices": "/path/to/Mills_and_1000G_gold_standard.indels.b37.sites.vcf.idx",
"mutation_calling.known_indels_sites_VCFs": "/path/to/Mills_and_1000G_gold_standard.indels.b37.sites.vcf",
"mutation_calling.af_only_gnomad": "/path/to/af-only-gnomad.raw.sites.b37.vcf.gz",
"mutation_calling.af_only_gnomad_index": "/path/to/af-only-gnomad.raw.sites.b37.vcf.gz.tbi",
"mutation_calling.annovar_protocols": "refGene,knownGene,cosmic70,esp6500siv2_all,clinvar_20180603,gnomad211_exome",
"mutation_calling.annovar_operation": "g,f,f,f,f,f"
}
The JSON using the Fred Hutch HPC
{
"mutation_calling.sampleFastq": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/wdl_101/HCC4006_final.fastq",
"mutation_calling.ref_fasta": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta",
"mutation_calling.ref_fasta_index": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.fai",
"mutation_calling.ref_dict": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.dict",
"mutation_calling.ref_pac": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.pac",
"mutation_calling.ref_sa": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.sa",
"mutation_calling.ref_amb": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.amb",
"mutation_calling.ref_ann": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.ann",
"mutation_calling.ref_bwt": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.bwt",
"mutation_calling.ref_name": "hg19",
"mutation_calling.dbSNP_vcf_index": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/dbsnp_138.b37.vcf.gz.tbi",
"mutation_calling.dbSNP_vcf": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/dbsnp_138.b37.vcf.gz",
"mutation_calling.known_indels_sites_indices": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Mills_and_1000G_gold_standard.indels.b37.sites.vcf.idx",
"mutation_calling.known_indels_sites_VCFs": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Mills_and_1000G_gold_standard.indels.b37.sites.vcf",
"mutation_calling.af_only_gnomad": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/af-only-gnomad.raw.sites.b37.vcf.gz",
"mutation_calling.af_only_gnomad_index": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/af-only-gnomad.raw.sites.b37.vcf.gz.tbi",
"mutation_calling.annovar_protocols": "refGene,knownGene,cosmic70,esp6500siv2_all,clinvar_20180603,gnomad211_exome",
"mutation_calling.annovar_operation": "g,f,f,f,f,f"
}