Named output

Difficulty level: easy
Time need to lean: 10 minutes or less
Key points:
- Output can be grouped by names, which can be referred to by [name]
- Function named_output(name) refers to output with name in any step
- Return value of name_output(name) can also have groups

Limitations of basic dataflow-based workflows

In our tutorial on How to define and execute basic SoS workflows we introduced basic dataflow-based workflows as follows:

xlsx2csv data/DEG.xlsx > DEG.csv

Basically, when the input of step plot (csv_file) is unavailable, SoS looks in the script for another step that generates this output. If it can be found, it will execute that step to produce the required input before step plot is executed.

A limitation of this kind of workflow is that the output of another step has to be determined "easily" either from the output statement itself, or with variable definitions from the global section. The following workflow would fail because the step of the output is defined as

output: _input.with_suffix('csv')

which takes the _input of the step and replaces its suffix with .csv. Because the _output depends on _input, it cannot be used to generate data/DEG.csv directly.

ERROR: No rule to generate target 'data/DEG.csv', needed by 'plot'.

RuntimeError: Workflow exited with code 1

Named output

Similar to input statement, output of SoS steps can also be named. In the following example

4 substeps are defined with i=0, 1, 2, and 3
The output of each substep is f'a_{i}.txt' and f'b_{i}.txt' (a_0.txt, b_0.txt etc).
The outputs are grouped to group a and b.
The output of the entire step consist of _output of substeps, which becomes the _input of the next step. This is how we can example the output of step 10.

a_0.txt b_0.txt with labels ['a', 'b']
a_0.txt
a_1.txt b_1.txt with labels ['a', 'b']
a_1.txt
a_2.txt b_2.txt with labels ['a', 'b']
a_2.txt
a_3.txt b_3.txt with labels ['a', 'b']
a_3.txt

As we can see, there are four substeps for step 20. The _input of substeps has two files with names a and b, and we can refer to the targets with name a with _input['a'].

Function `named_output`

Function `named_output(name, group_by, ...)`

Function named_output refers the named output of any SoS step defined in the script. Using named_output in the input statement of a step will create an dependency on the step with the named output, and insert the named output as input of the step.

The problem we had with complex output can be resolved by function named_output(). For example, the aforementioned workflow can be written as

xlsx2csv data/DEG.xlsx > data/DEG.csv

null device 
          1

Here named_output('csv') refers to any step that produces an output with name csv, which is the step convert in this workflow. The input of step plot is the return value of named_output('csv') which is data/DEG.csv, although its exact name can only be identified after the conversion step is executed.

Uniqueness of names of output

Although outputs of steps can be identified with arbitrary names and mulitple steps can have the same names for outputs, names refered by function named_output have to be unique.

`named_output()` can only be called from input statements

named_output() is a function provided by SoS to define input of steps and can only be called from input statements.

Groups of output returned by `named_output` *

As we have seem, the output of a step can have multiple groups. In this case the return value of named_output(name) consists of the name part of all groups.

In the following example, named_output('a') obtains the a part of the output of step A, which consists of 4 groups. During the execution of the workflow, step A is executed to generate input for step default, which consists of 4 steps with _input equals a_0.txt, a_1.txt etc.

Generating a_0.bak
Generating a_1.bak
Generating a_2.bak
Generating a_3.bak

Option `group_by` of function `output_from`

Option group_by regroups the groups returned by output_from

If you would like to remove the groups or re-group the returned files using another method, you can use the group_by option of function output_from. For example, the group_by='all' option in the following example groups all 4 input files into a single group:

Generating a_0.bak a_1.bak a_2.bak a_3.bak

`named_output()` with skipped substeps

Function named_output obtains outputs, actually substeps output from another step. There is, however, a case when a substep is skipped and leaves no output. In this case, the substep output is dicarded.

For example, when a substep in the step A of the following workflow is skipped, the result from named_output('A') contains only the output of valid substeps.

However, if you would like to keep consistent number of substeps across steps, you can handle get output from all substeps by using option remove_empty_groups=False.

output_0.txt
output_1.txt
output_3.txt

Named output

Limitations of basic dataflow-based workflows

Named output

Function named_output

Function named_output(name, group_by, ...)

Uniqueness of names of output

named_output() can only be called from input statements

Groups of output returned by named_output *

Option group_by of function output_from

named_output() with skipped substeps

Function `named_output`

Function `named_output(name, group_by, ...)`

`named_output()` can only be called from input statements

Groups of output returned by `named_output` *

Option `group_by` of function `output_from`

`named_output()` with skipped substeps