- Difficulty level: easy
- Time need to lean: 10 minutes or less
- Key points:
- Data flow style workflows construct workflows by connecting data between steps
- Inputs and outputs of such workflows need to be statically defined
Let us write a workflow with numerically indexed steps. In the following cell, we used
!
magic to execute two shell commands to create an input file and remove any output that might have existed.%run
magic with option-d test.dot
, which records the DAG (direct acyclic graph) of the workflow into a graphviz dot file. Multiple DAGs will be saved with chaning status of each step.%preview
magic that conerts thetest.dot
file into an animation that shows the DAG at different stage of the execution of the workflow.
We then define a workflow with four steps
10
: runfastqc
to check the quality of input fastqc file. For simplicity, we usetouch()
action to generate all the output files.20
: align the input reads infastqc
files to generate abam
file.30
: index thebam
file into abai
file.40
: call variants from thebam
andbai
file and generates avcf
file.
We define output
for each step but in reality you can use {sample}.ext
in your scripts directly without using any output statement.
As shown by the DAG, the workflow executes each step sequentially. Technically speaking, because no inputs are defined for steps 20
, 30
and 40
, they are assumed to be dependent on their previous steps.
Dataflow-based workflows
Dataflow-based workflows construct workflows by the flow of data. Namely, steps in dataflow-based workflows defines input and output files and the workflow engine connects steps when certain data is needed.
The previous workflow works but the steps can only be executed sequentially. By defining the input and output of each step explicitly, the workflow can be written in a dataflow style as follows:
Because the input and output of each step is clearly defined, SoS knows that the align
and bam2bai
steps have to be called before step call
, and the default
workflow can only be completed with the generation of data.vcf
and data.html
.
The numeric numering of steps provides a default order of execution that can be overridden if inputs and outputs are defined. For example, in the following example, the steps are actually executed in the order of 30
, 40
(concurrently), 20
and then 10
. A default
step is not added because all steps in numerically indexed workflows will be executed, although the orders are not preserved in this particular case.
When your workflow becomes more complex, you can define data
by their names as follows: