Edit this page on our live server and create a PR by running command !create-pr in the console panel

Process-oriented workflows

  • Difficulty level: easy
  • Time need to lean: 20 minutes or less
  • Key points:
    • Process-oriented workflow specifies workflows and steps to execute

Process-oriented workflows

    

Process-oriented workflows execute steps. For example, the first example in our tutorial on SoS workflow defines a workflow plot with two steps plot_10 and plot_20. The magic %run plot or command sos run script plot executes all steps in the workflow, regardless of these steps produce any output.

In [1]:
xlsx2csv data/DEG.xlsx > DEG.csv

null device 
          1 

Default input of numerically-indexed steps

The previous example simply lists all the scripts for each step and does not specify the input and output of the step. SoS assumes that steps without input statement depends on all its previous steps. That is to say, plot_20 will be executed after plot_10, and plot_30, if exists, will be executed after both steps plot_10 and plot_20. The entire step will be executed sequentially.

You can add input and output statements to the steps, which allows you to

  • Use variables _input and _output in scripts, which is arguably more readable.
  • Allows SoS to track the input and output of steps and create signatures. Steps will be ignored if they have been executed before. See runtime signature for details.
  • Allows SoS to determine step dependencies and create DAGs so that SoS can execute steps in parallel (see next section).

The following workflow is the version of the previous workflow with input and output statements. Note that, however, that plot_20 does not define input because a numerically-indexed step by default takes the step_output of its previous step (step_10 in this case) as its step_input.

In [2]:
xlsx2csv data/DEG.xlsx > DEG.csv

null device 
          1 

DAG of process-oriented workflow

Concepturally speaking process-oriented workflows are executed sequentially. When you design a workflow, you focus on initial input files, and how they are processed step by step. However, in a complex workflow, there will be branches of the process and you can execute these branches in parallel if you specify input and output of steps.

For example, your workflow can have multiple starting points with different input files:

In [3]:
Summarizing results
> dag1.dot (664 B):
No description has been provided for this image

In this workflow, steps 10 and 20 are executed in parallel because they have different input files and do not depend on each other.

As a slightly more complex example, the following workflow has two longer branches with 20 executed after 10, and 40 after 30. More interestingly, because it takes longer for step 10 to execute, step 40 actually starts before step 20. That is to say, although the workflow executes sequentially conceptually, in really the steps could be executed out of their numerical order.

In [4]:
Generating a.bak at step 10
Generating b.res at step 30
Generating b.res1 at step 40
Generating a.bak1 at step 20
Summarizing results
> dag2.dot (1.5 KiB):
No description has been provided for this image