- Difficulty level: easy
- Time need to lean: 10 minutes or less
- Key points:
- Data flow style workflows construct workflows by connecting data between steps
- Inputs and outputs of such workflows need to be statically defined
Let us write a workflow with numerically indexed steps. In the following cell, we used
!magic to execute two shell commands to create an input file and remove any output that might have existed.%runmagic with option-d test.dot, which records the DAG (direct acyclic graph) of the workflow into a graphviz dot file. Multiple DAGs will be saved with chaning status of each step.%previewmagic that conerts thetest.dotfile into an animation that shows the DAG at different stage of the execution of the workflow.
We then define a workflow with four steps
10: runfastqcto check the quality of input fastqc file. For simplicity, we usetouch()action to generate all the output files.20: align the input reads infastqcfiles to generate abamfile.30: index thebamfile into abaifile.40: call variants from thebamandbaifile and generates avcffile.
We define output for each step but in reality you can use {sample}.ext in your scripts directly without using any output statement.
As shown by the DAG, the workflow executes each step sequentially. Technically speaking, because no inputs are defined for steps 20, 30 and 40, they are assumed to be dependent on their previous steps.
Dataflow-based workflows
Dataflow-based workflows construct workflows by the flow of data. Namely, steps in dataflow-based workflows defines input and output files and the workflow engine connects steps when certain data is needed.
The previous workflow works but the steps can only be executed sequentially. By defining the input and output of each step explicitly, the workflow can be written in a dataflow style as follows:
Because the input and output of each step is clearly defined, SoS knows that the align and bam2bai steps have to be called before step call, and the default workflow can only be completed with the generation of data.vcf and data.html.
The numeric numering of steps provides a default order of execution that can be overridden if inputs and outputs are defined. For example, in the following example, the steps are actually executed in the order of 30, 40 (concurrently), 20 and then 10. A default step is not added because all steps in numerically indexed workflows will be executed, although the orders are not preserved in this particular case.
When your workflow becomes more complex, you can define data by their names as follows: