Chapter 3 Developing, Testing and Scaling Workflows
3.1 Creating an Inputs template
There is a tool that is installed along with Cromwell called “womtool”. This tool is what does workflow validation through the Shiny app and fh.wdlR package. However, it has additional features that are not available unless you run it directly.
There is specific documentation from the Broad about womtool that can give you more information about the parameters and features of this small tool.
The most useful aspect to womtool is the ability to give it a new workflow you have written and have it give you a template input.json file for you to customize for use. To use it, log into Rhino and do the following:
module load cromwell/84
java -jar $EBROOTCROMWELL/womtool.jar inputs myNewWorkflow.wdl > input_template.json
Now your template input information will be saved in the file input_template.json
.
3.2 Approaches to Testing and Development
3.2.1 Start from a template
Preferably one that executes cleanly - even if it’s Hello World.
3.2.2 Chose Software Modules
Test interactively with software modules on the cluster to see what inputs are required, what parameters you want to specify, what outputs get created.
3.2.3 Add Tasks
Define tasks, using modules, test data (truncated files, scatters of 1), run and test.
3.2.4 Scale Up Inputs
Start to run workflow on full size data (scatters of 1), then start to scatter over several files, then scatter over entire datasets.
3.2.5 Prep for Other Platforms and Sharing
Shift to docker containers instead of modules, ensure that all inputs are specified as files not directories!!, start to optimize compute resources required for tasks (how much memory does it really need, or is it going to need many CPU’s). - start small - start with truncated/downsampled data in the same formats but smaller! - add tasks one at a time and leverage call caching to test single additions quickly - start testing scatters with a scatter of one!
3.2.6 Testing Input Availability
Validating a workflow only checks the formatting of the files but does not attempt to stage the inputs. To do that you might consider tricking Cromwell into localizing all external inputs first by creating a dummy task that runs first before any of your steps in your workflow.
The inputs to this task need to be all externally obtained file inputs to any task in your workflow (not inputs to tasks that come from other WDL tasks!). Then, upon running this workflow, Cromwell will try to localize all the inputs for this first task, before running any future tasks.
Then you have two options: - my inputs are small or local: just remove the input localizing task before re-running the workflow - my inputs are large or expensive to localize: specify the inputs to your workflow tasks as the outputs of this input localizing task by adding them as outputs.
3.2.7 Scaling Up, Moving to the Cloud
- Test with a scatter of many small files (not full size!)
- Begin optimizing computing resources for each task
- Dockerize single task environments
- Test locally on small data scatters of 1 to shift from modules to Dockerize
- Catch errors and adjust such as tools that rely on filesystems/directory structures which will break in the cloud
- Learn about the cloud you’ll be using
- what instance types exist and which are the best for you tasks/data
- how do the data need to be stored in an object store?
- how can you get permissions set up to access both the data and the compute resources?