- Difficulty level: easy
- Time need to lean: 10 minutes or less
- Key points:
- Output can be grouped by names, which can be referred to by
[name]
- Function
named_output(name)
refers to output withname
in any step - Return value of
name_output(name)
can also have groups
- Output can be grouped by names, which can be referred to by
In our tutorial on How to define and execute basic SoS workflows we introduced basic dataflow-based workflows as follows:
Basically, when the input of step plot
(csv_file
) is unavailable, SoS looks in the script for another step that generates this output. If it can be found, it will execute that step to produce the required input before step plot
is executed.
A limitation of this kind of workflow is that the output of another step has to be determined "easily" either from the output
statement itself, or with variable definitions from the global
section. The following workflow would fail because the step of the output is defined as
output: _input.with_suffix('csv')
which takes the _input
of the step and replaces its suffix with .csv
. Because the _output
depends on _input
, it cannot be used to generate data/DEG.csv
directly.
Similar to input statement, output of SoS steps can also be named. In the following example
- 4 substeps are defined with
i=0
,1
,2
, and3
- The output of each substep is
f'a_{i}.txt'
andf'b_{i}.txt'
(a_0.txt
,b_0.txt
etc). - The outputs are grouped to group
a
andb
. - The output of the entire step consist of
_output
of substeps, which becomes the_input
of the next step. This is how we can example the output of step10
.
As we can see, there are four substeps for step 20
. The _input
of substeps has two files with names a
and b
, and we can refer to the targets with name a
with _input['a']
.
Function named_output(name, group_by, ...)
Function named_output
refers the named output of any SoS step defined in the script. Using named_output
in the input
statement of a step will create an dependency on the step with the named output, and insert the named output as input of the step.
The problem we had with complex output can be resolved by function named_output()
. For example, the aforementioned workflow can be written as
Here named_output('csv')
refers to any step that produces an output with name csv
, which is the step convert
in this workflow. The input of step plot
is the return value of named_output('csv')
which is data/DEG.csv
, although its exact name can only be identified after the conversion step is executed.
Uniqueness of names of output
Although outputs of steps can be identified with arbitrary names and mulitple steps can have the same names for outputs, names refered by function named_output
have to be unique.
named_output()
can only be called from input statements
named_output()
is a function provided by SoS to define input of steps and can only be called from input statements.
As we have seem, the output of a step can have multiple groups. In this case the return value of named_output(name)
consists of the name
part of all groups.
In the following example, named_output('a')
obtains the a
part of the output of step A
, which consists of 4 groups. During the execution of the workflow, step A
is executed to generate input for step default
, which consists of 4 steps with _input
equals a_0.txt
, a_1.txt
etc.
Option group_by
of function output_from
Option group_by
regroups the groups returned by output_from
If you would like to remove the groups or re-group the returned files using another method, you can use the group_by
option of function output_from
. For example, the group_by='all'
option in the following example groups all 4 input files into a single group:
Function named_output
obtains outputs, actually substeps output from another step. There is, however, a case when a substep is skipped and leaves no output. In this case, the substep output is dicarded.
For example, when a substep in the step A
of the following workflow is skipped, the result from named_output('A')
contains only the output of valid substeps.
However, if you would like to keep consistent number of substeps across steps, you can handle get output from all substeps by using option remove_empty_groups=False
.