Basic SoS workflows

Difficulty level: easy
Time need to lean: 10 minutes or less
Key points:
- A forward-style workflow consists of numerically numbered steps
- Multiple workflows can be defined in a single SoS script or notebook
- Optional input and output statements can be added to change how workflows are executed
- The default input of a step if the output of its previous step

Simple workflows with numerically numbered steps

The workflows you have seen so far have numerically numbered steps. For example, this example from the first tutorial have a single workflow plot with steps plot_10 and plot_20 and SoS will execute the steps in numeric ordering.

Workflows with numerically numbered steps

Steps have the format of name_index (e.g. step_10) where name is the name of the workflow, and index is a numeric number
Step indexes are usually not consecutive to allow easy insertion of new steps
Both workflow name and index can be ignored. 10, 20 etc are considered as steps of an unnamed workflow, step is considered as a one-step workflow
Workflow can be executed by workflow name (e.g. %run name in Jupyter, sos run name from command line). A default workflow will be execute if only one workflow is defined, or a default workflow is defined

xlsx2csv data/DEG.xlsx > DEG.csv

null device 
          1

The workflow is executed by default with magic %run because only one workflow is defined in the script. You can also define multiple workflows and execute them by their names. For example, the following script defines two single-step workflows convert and plot. Because there is no default workflow, you will have to refer to them with their names:

%run convert

and

%run plot

xlsx2csv data/DEG.xlsx > DEG.csv

null device 
          1

Default input of steps

As shown in in How to specify input and output files and process input files in groups, you can define input and output for each step.

Default `input` of numerically numbered workflows

The default input of a step in a numerically numbered workflow is the output of its previous step

Therefore, in the following workflow, the input statement of plot_20 can be ignored.

xlsx2csv data/DEG.xlsx > DEG.csv

null device 
          1

Basic data-flow based workflows

We have shown the same workflows in the plot_10, plot_20 style, in the convert and plot style, and with and without specification of input and output. What will happen if you define a workflow in separate steps with input and output statements?

Let us first remove the intermediate DEG.csv,

and execute the plot step of the following workflow

Simple data-flow based workflow

If the input files of a step do not exist, SoS will automatically check other steps in the workflow and call them to generate the needed files. This allows the creation of workflows based on data flow.

xlsx2csv data/DEG.xlsx > DEG.csv

null device 
          1

As you can see, although the step plot is requested, SoS executes both the convert and plot steps because the required input file csv_file (DEG.csv) does not exist. In this case, SoS will look for steps that produces DEG.csv and execute it to generate DEG.csv before plot is executed.

Output of data-flow based workflow

For output files to be automatically identified by SoS as input for another step, the output statement must be clearly defined. That is to say, they must be either

One or more filenames (e.g. output: "DEG.csv") or
Some expression that can be easily evaluated from variables defined in the global section (e.g. output: csv_file)

So output derived from `_input` cannot be used (e.g. output: _input.with_suffix('.bak')). However,

You can assign complex output with a name and use named_output() to refer to it.
You can create makefile-style steps and allows the creation of files through pattern-matching.

Passing output of steps with substeps *

Output of a step with substeps

If a step has multiple substeps, the step output consists of _output from each substep, which will be by default passed to the next step and create multiple substeps.

Things can get a little bit complicated when a step has multiple substeps. As you can recall from How to specify input and output files and process input files in groups, multiple substeps can be defined by input option group_by, each with its own _input and _output. When the output of such a step is inherited by another step, these _output will become the _input of the substeps.

For example, after running fastqc on the input fastq files, we would like to process the generated HTML file and check if the qualities are ok. We use the beautifulsoup Python module and find all the <h2> headers. Without going into the details of the use of beautifulsoup to parse HTML files, you should notice that

No input is defined for step 20 so it takes the output of step 10 as its input.
The output of step 10 contains two groups, data/S20_R1_fastqc.html and data/S20_R2_fastqc.html, which becomes the input of two substeps of step 20.
The input of step 20 are processed one by one

S20_R1_fastqc Basic Statistics: [OK]
S20_R1_fastqc Per base sequence quality: [OK]
S20_R1_fastqc Per tile sequence quality: [OK]
S20_R1_fastqc Per sequence quality scores: [OK]
S20_R1_fastqc Per base sequence content: [FAIL]
S20_R1_fastqc Per sequence GC content: [FAIL]
S20_R1_fastqc Per base N content: [OK]
S20_R1_fastqc Sequence Length Distribution: [WARN]
S20_R1_fastqc Sequence Duplication Levels: [OK]
S20_R1_fastqc Overrepresented sequences: [FAIL]
S20_R1_fastqc Adapter Content: [OK]
S20_R2_fastqc Basic Statistics: [OK]
S20_R2_fastqc Per base sequence quality: [OK]
S20_R2_fastqc Per tile sequence quality: [OK]
S20_R2_fastqc Per sequence quality scores: [OK]
S20_R2_fastqc Per base sequence content: [FAIL]
S20_R2_fastqc Per sequence GC content: [FAIL]
S20_R2_fastqc Per base N content: [OK]
S20_R2_fastqc Sequence Length Distribution: [WARN]
S20_R2_fastqc Sequence Duplication Levels: [OK]
S20_R2_fastqc Overrepresented sequences: [FAIL]
S20_R2_fastqc Adapter Content: [OK]

If you would like to re-group the default input, you can redefine the input explicitly, or apply option group_by to the default input:

[##] 2 steps processed (1 job completed, 2 jobs ignored)