- Difficulty level: easy
- Time need to lean: 10 minutes or less
- Key points:- Output can be grouped by names, which can be referred to by [name]
- Function named_output(name)refers to output withnamein any step
- Return value of name_output(name)can also have groups
 
- Output can be grouped by names, which can be referred to by 
In our tutorial on How to define and execute basic SoS workflows we introduced basic dataflow-based workflows as follows:
Basically, when the input of step plot (csv_file) is unavailable, SoS looks in the script for another step that generates this output. If it can be found, it will execute that step to produce the required input before step plot is executed.
A limitation of this kind of workflow is that the output of another step has to be determined "easily" either from the output statement itself, or with variable definitions from the global section. The following workflow would fail because the step of the output is defined as
output: _input.with_suffix('csv')
which takes the _input of the step and replaces its suffix with .csv. Because the _output depends on _input, it cannot be used to generate data/DEG.csv directly.
Similar to input statement, output of SoS steps can also be named. In the following example
- 4 substeps are defined with i=0,1,2, and3
- The output of each substep is f'a_{i}.txt'andf'b_{i}.txt'(a_0.txt,b_0.txtetc).
- The outputs are grouped to group aandb.
- The output of the entire step consist of _outputof substeps, which becomes the_inputof the next step. This is how we can example the output of step10.
As we can see, there are four substeps for step 20. The _input of substeps has two files with names a and b, and we can refer to the targets with name a with _input['a'].
Function named_output(name, group_by, ...)
Function named_output refers the named output of any SoS step defined in the script. Using named_output in the input statement of a step will create an dependency on the step with the named output, and insert the named output as input of the step.
The problem we had with complex output can be resolved by function named_output(). For example, the aforementioned workflow can be written as
Here named_output('csv') refers to any step that produces an output with name csv, which is the step convert in this workflow. The input of step plot is the return value of named_output('csv') which is data/DEG.csv, although its exact name can only be identified after the conversion step is executed.
Uniqueness of names of output
Although outputs of steps can be identified with arbitrary names and mulitple steps can have the same names for outputs, names refered by function named_output have to be unique.
named_output() can only be called from input statements
named_output() is a function provided by SoS to define input of steps and can only be called from input statements.
As we have seem, the output of a step can have multiple groups. In this case the return value of named_output(name) consists of the name part of all groups.
In the following example, named_output('a') obtains the a part of the output of step A, which consists of 4 groups. During the execution of the workflow, step A is executed to generate input for step default, which consists of 4 steps with _input equals a_0.txt, a_1.txt etc.
Option group_by of function output_from
Option group_by regroups the groups returned by output_from
If you would like to remove the groups or re-group the returned files using another method, you can use the group_by option of function output_from. For example, the group_by='all' option in the following example groups all 4 input files into a single group:
Function named_output obtains outputs, actually substeps output from another step. There is, however, a case when a substep is skipped and leaves no output. In this case, the substep output is dicarded.
For example, when a substep in the step A of the following workflow is skipped, the result from named_output('A') contains only the output of valid substeps.
However, if you would like to keep consistent number of substeps across steps, you can handle get output from all substeps by using option remove_empty_groups=False.