Chapter 8 Optional types
WDL supports defining optional variables, which are denoted with a question mark next to the type declaration. These variables may or may not defined. Optional types are powerful, but they can quickly cause problems in complicated workflows. This is especially the case when working with scattered tasks. In this section, we’ll go over how optional variables can be used, and where they can cause problems.
8.1 Optional inputs
The most common use case for optional variables are optional inputs for the user to provide some sort of file, for example, a reference genome or a metadata file. Sometimes, you will want optional inputs to fall back on a known file or value.
A good place to use optional types in our workflow would be in the bwa mem
task. bwa mem
on default settings may attempt to use too many threads, which could cause this task to fail if it ran on weaker hardware such as a laptop. In order to allow the user to adjust how many threads bwa mem
uses, and by extension make our WDL run on a wider variety of machines, we can edit the task to add a new variable. This new variable will replace -t 16
in our call to bwa mem
.
task BwaMem {
input {
File input_fastq
referenceGenome refGenome
# Number of threads to use - must be defined
Int threads
}
command <<<
set -eo pipefail
mv "~{refGenome.ref_fasta}" .
mv "~{refGenome.ref_fasta_index}" .
mv "~{refGenome.ref_dict}" .
mv "~{refGenome.ref_amb}" .
mv "~{refGenome.ref_ann}" .
mv "~{refGenome.ref_bwt}" .
mv "~{refGenome.ref_pac}" .
mv "~{refGenome.ref_sa}" .
bwa mem \
-p -v 3 -t ~{threads} -M -R '@RG\t~{read_group_id}\t~{sample_name}\t~{platform_info}' \
"~{ref_fasta_local}" "~{input_fastq}" > "~{base_file_name}.sam"
samtools view -1bS -@ 15 -o "~{base_file_name}.aligned.bam" "~{base_file_name}.sam"
samtools sort -@ 15 -o "~{base_file_name}.sorted_query_aligned.bam" "~{base_file_name}.aligned.bam"
[...]
}
Because ~{threads}
is an integer variable, it cannot have a space in it, so we don’t need to surround ~{threads}
with quotation marks like we should String or File variables.
Now we can set Int? bwa_mem_threads
as a workflow-level optional input. Making it optional means that users who do not want to think about threads – perhaps they’re using an HPC or other powerful backend – do not have to worry about filling in that value. In the workflow body, when we call bwa mem
, the value of bwa_mem_threads
is passed to the BwaMem
task.
workflow mutation_calling {
input {
Array[File] tumorSamples
File normalFastq
referenceGenome refGenome
# Optional variable for bwa mem
Int? bwa_mem_threads
# Files for specific tools
File dbSNP_vcf
File dbSNP_vcf_index
File known_indels_sites_VCFs
File known_indels_sites_indices
File af_only_gnomad
File af_only_gnomad_index
# Annovar options
String annovar_protocols
String annovar_operation
}
call BwaMem as normalBwaMem {
input:
input_fastq = normalFastq,
refGenome = refGenome,
threads = bwa_mem_threads
}
[...]
}
But what happens if the workflow-level variable bwa_mem_threads
isn’t defined? The task BwaMem
will then get no value for the task-level variable threads
. This will cause an error at runtime. We can’t solve this merely by making the task level variable optional too – otherwise, if threads
isn’t defined, our bwa mem
call will provide -t
without a number after it, which is invalid.
One way to address this is with the WDL built-in function select_first()
, which takes in an array of values. In the workflow body, we can use this plus a fallback value (or another variable) to ensure that the BwaMem
task always gets a valid integer for the threads
argument.
call BwaMem as normalBwaMem {
input:
input_fastq = normalFastq,
refGenome = refGenome,
threads = select_first([bwa_mem_threads, 16])
}
This is a perfectly valid way of doing things, but the task itself still relies on that variable being defined. In the context of our workflow, that doesn’t matter. However, keep in mind it is a common practice to reuse code for multiple scripts – whether that is directly importing other WDL tasks using WDL’s importing feature, or simply copy-pasting boilerplate code. With that in mind, it’s a good idea to make tasks as “stand-alone” as possible, and not reliant on the workflow calling them to use select_first()
.
As such, it’s more common to make the task itself handle this by providing a fallback value, like so:
task BwaMem {
input {
File input_fastq
referenceGenome refGenome
Int threads = 16 # if a workflow passes an optional variable with no value, fall back to 16
}
[...]
}
Here’s another way of providing a fallback value for optional inputs within a task.
In the following task, the phylogenetic tree program UShER expects a reference genome, an input phylogenetic tree, samples in the form of diff files, and an output file name. However, the Docker image we’re using (ashedpotatoes/usher-plus:0.0.2
) already contains the H37Rv reference genome, which is the standard reference genome for Mycobacterium tuberculosis. If we are working with tuberculosis samples, there is no reason to provide a reference genome – we might as well use the one that’s already in the Docker image.
task usher_sampled_diff {
input {
File diff
File input_tree # in MAT format, extension should be .pb
String output_file_name
File? ref_genome
}
command <<<
if [[ "~{ref_genome}" = "" ]]
then
# if not defined, fall back to H37Rv in the Docker image
ref="/HOME/usher/ref/Ref.H37Rv/ref.fa"
else
ref="~{ref_genome}"
fi
usher-sampled --diff "~{diff}" \
-i "${input_tree}" \
--ref "$ref" \
-o "~{output_file_name}"
>>>
runtime {
docker: "ashedpotatoes/usher-plus:0.0.2"
}
output {
File usher_tree = output_file_name
}
}
8.2 Optional outputs
Optional outputs are usually created by explicitly declaring a task or workflow’s output variable to be optional. They can also be created by enclosing a task within an if-block, which will cause all outputs of that task to be considered optional.
8.2.1 Declaring a task’s output to be optional
Most backends will save the task-level outputs of a task, even if those outputs are not workflow-level outputs. This can be very useful if you want to be able to check intermediate files, or perhaps log files for certain bioinformatic tools. However, if you’re not actively debugging, those files may take up a lot of space on your backend if you’re running on many samples over time.
You could change the workflow every time you wish to keep those intermediate inputs, but it would be easier to make the output optional.
task usher_sampled_diff {
input {
File diff
File input_tree # in MAT format, extension should be .pb
String output_file_name
File? ref_genome
Boolean detailed_clades
}
# UShER will output clade information if you pass the -D flag to it, but
# we don't want users to have to type in flags as a String variable. Instead,
# we just make a Boolean variable called detailed_clades, and outside the inputs
# section, we write the string "-D" if detailed_clades is true, or an empty string
# if detailed_clades is false.
String detailed_clades_flag = if !(detailed_clades) then "" else "-D "
command <<<
if [[ "~{ref_genome}" = "" ]]
then
ref="/HOME/usher/ref/Ref.H37Rv/ref.fa"
else
ref="~{ref_genome}"
fi
usher-sampled ~{detailed_clades_flag} \
--diff "~{diff}" \
-i "${input_tree}" \
--ref "$ref" \
-o "~{output_file_name}"
>>>
runtime {
docker: "ashedpotatoes/usher-plus:0.0.2"
}
output {
File usher_tree = output_file_name
File? clades = "clades.txt" # only if detailed_clades = true
}
}
If we want, we can make this task-level output a workflow-level output – perhaps we use a backend that does not save task-level outputs, or it is simply easier to access workflow-level outputs on this particular backend. Provided that the task is not scattered, this is very easy to do. (We’ll talk about scattered tasks later, as they complicate matters.) In the workflow output sections, we simply define the output as optional like we did in the task itself.
8.2.2 Making an entire task optional
Let’s say we want to make an entire task optional. Perhaps it provides additional QC information, or isn’t relevant to certain data types.
Although WDL has a concept of if
, it does not have any concept of else
, so to make mutually exclusive tasks, you essentially need use two mutually exclusive if-statements. In the example below, we use two mutually exclusive if-statements to decide whether to run task_x or task_y. This is a valid way of doing things, and will work as expected:
workflow do_x_or_y {
input {
Boolean do_x = true # if true do task_x, if false do task_y
File some_input
}
if (do_x = true) {
call task_x {
input:
some_input = some_input
}
}
if (do_x = false) {
call task_y {
input:
some_input = some_input
}
}
}
However, we need to keep in mind that WDL is not “aware” that these two tasks are mutually exclusive. You and I both know that if task_x has run, then task_y must not have run, but your WDL executor does not know this. So, as a consequence, you cannot use mutually exclusive if-blocks to set a variable to a particular value.
# this workflow will fail miniwdl check or womtool
workflow do_x_or_y {
input {
Boolean do_x = true # if true do task_x, if false do task_y
File some_input
}
if (do_x = true) {
Int some_variable = 10 # this is problematic, even though only one of these if blocks will execute!
call task_x {
input:
some_input = some_input
}
}
if (do_x = false) {
Int some_variable = 5 # this is problematic, even though only one of these if blocks will execute!
call task_y {
input:
some_input = some_input
}
}
}
Another consequence is that both of your mutually exclusive tasks’ outputs will be considered optional types, even though you can be certain that set of outputs absolutely do exist and one set of outputs absolutely do not exist.
workflow do_x_or_y {
input {
Boolean do_x = true # if true do task_x, if false do task_y
File some_input
}
if (do_x = true) {
call task_x {
input:
some_input = some_input
}
}
if (do_x = false) {
call task_y {
input:
some_input = some_input
}
}
outputs {
# these both MUST be optional
File? task_x_output = task_x.something
File? task_y_output = task_y.something
}
}
Checking if a variable is defined
You can check if a variable is defined at runtime using the WDL built-in defined()
, which returns a Boolean value. However, defined()
can give unexpected output, as we will go over later. For the time being, let’s use a simple example where we set a Boolean variable via defined()
. This will work as expected – if task_x ran and created the output task_x.something
, then did_x_run
will be true, and so on.
workflow do_x_or_y {
input {
Boolean do_x = true # if true do task_x, if false do task_y
File some_input
}
if (do_x = true) {
call task_x {
input:
some_input = some_input
}
}
if (do_x = false) {
call task_y {
input:
some_input = some_input
}
}
# these work as you would expect
Boolean did_x_run = defined(task_x.something)
Boolean did_y_run = defined(task_y.something)
outputs {
File? task_x_output = task_x.something
File? task_y_output = task_y.something
Boolean did_x_run = did_x_run
Boolean did_y_run = did_y_run
}
}
Note that although you could decide whether or not to execute a task by wrapping that task in a defined()
if-block, you cannot use defined()
to coerce a variable. It’s generally best to use select_first()
for type coercion, although we will mention some exceptions when it comes to arrays.
# if some_integer has type Int?, we cannot coerce it like this:
if defined(some_integer) {
Int real_some_integer = some_integer
}
# because outside of the if-block, real_some_integer is considered optional
You should not use defined()
with optional arrays, as it often acts in unexpected ways. It is very easy to accidentally create a defined()
check that always returns true, as WDL executors may define “optional” arrays even if they have no values. For example, let’s consider task_x and task_y again. To recap, these are mutually exclusive tasks which each output a single File. If these tasks are scattered from outside their respective if-blocks, the their outputs are no longer File?, instead, they are arrays. But are they arrays of optional File?s, or are the arrays themselves optional? Let’s try to find out:
workflow do_x_or_y {
input {
Boolean do_x = true # if true do task_x, if false do task_y
Array[File] some_inputs
}
scatter(some_input in some_inputs) {
if (do_x = true) {
call task_x {
input:
some_input = some_input
}
}
if (do_x = false) {
call task_y {
input:
some_input = some_input
}
}
}
# these both always return true!
Boolean did_x_run = defined(task_x.something)
Boolean did_y_run = defined(task_y.something)
outputs {
File? task_x_output = task_x.something
File? task_y_output = task_y.something
Boolean did_x_run = did_x_run
Boolean did_y_run = did_y_run
}
}
Instead of using defined()
on an array, consider simply checking its length. An empty array has a length of 0, so if the length of an array is greater than 1, you know that there is at least one item existing in that array.
Working with optional arrays and advanced usage of select_first()
Coercing arrays with optional components
Optional arrays can be difficult to work with. One of the reasons for this is that they sometimes give unexpected values for defined()
. As mentioned earlier, this can be somewhat alleviated by checking the length of an array rather than whether or not it is defined, but the problems do not end there: Attempting to scatter on an optional array can cause issues that are difficult to debug.
It’s generally best to coerce optional arrays into not-optional arrays as much as possible. Thankfully, we can use select_all()
to select only the defined elements of an array. For our example above, we can go even further by creating an array that will always have a defined value, whether or not do_x
is true. We combine the power of select_all()
and flatten()
to create the following:
Array[File] x_or_y_outfiles = flatten(select_all(x.outfile), select_all(y.outfile))
Make sure not to use flatten(select_all(x.outfile, y.outfile))
! That would result in an Array[File] that has a null value, which can cause hard-to-diagnose issues later on in your pipeline.
If you’re absolutely certain your variable, in this particular scope, is defined, you can make the second member of the select_first()
array a random bogus fallback value. You may want to add a comment to prevent you or other programmers accidentally breaking things should they edit the code later.
if defined(some_integer) { # needed to define real_some_integer, beware of changing this!
Int real_some_integer = select_first([some_integer, 1975])
}