Edit this page on our live server and create a PR by running command !create-pr in the console panel

Data-flow style workflows

  • Difficulty level: easy
  • Time need to lean: 10 minutes or less
  • Key points:
    • Data flow style workflows construct workflows by connecting data between steps
    • Inputs and outputs of such workflows need to be statically defined

Workflows with numerically indexed steps

Let us write a workflow with numerically indexed steps. In the following cell, we used

  • ! magic to execute two shell commands to create an input file and remove any output that might have existed.
  • %run magic with option -d test.dot, which records the DAG (direct acyclic graph) of the workflow into a graphviz dot file. Multiple DAGs will be saved with chaning status of each step.
  • %preview magic that conerts the test.dot file into an animation that shows the DAG at different stage of the execution of the workflow.

We then define a workflow with four steps

  • 10: run fastqc to check the quality of input fastqc file. For simplicity, we use touch() action to generate all the output files.
  • 20: align the input reads in fastqc files to generate a bam file.
  • 30: index the bam file into a bai file.
  • 40: call variants from the bam and bai file and generates a vcf file.

We define output for each step but in reality you can use {sample}.ext in your scripts directly without using any output statement.

In [1]:

As shown by the DAG, the workflow executes each step sequentially. Technically speaking, because no inputs are defined for steps 20, 30 and 40, they are assumed to be dependent on their previous steps.

Dataflow based workflow

The previous workflow works but the steps can only be executed sequentially. By defining the input and output of each step explicitly, the workflow can be written in a dataflow style as follows:

In [2]:

Because the input and output of each step is clearly defined, SoS knows that the align and bam2bai steps have to be called before step call, and the default workflow can only be completed with the generation of data.vcf and data.html.

The numeric numering of steps provides a default order of execution that can be overridden if inputs and outputs are defined. For example, in the following example, the steps are actually executed in the order of 30, 40 (concurrently), 20 and then 10. A default step is not added because all steps in numerically indexed workflows will be executed, although the orders are not preserved in this particular case.

In [3]:

Use of named_output for more complex cases

When your workflow becomes more complex, you can define data by their names as follows:

In [4]: