The `output` statement

Difficulty level: easy
Time need to lean: 10 minutes or less
Key points:
- Step output are defined for each substep and can be derived from substep input (variable _input)
- Variable step_output is defined at the completion of the step, and can be passed to other steps

The output statement defines the output files or targets of a SoS step, it is optional but is fundamental for the creation of all but very simple workflows. You can check out the How to create dependencies between SoS steps tutorial for a quick overview of the use of output statements. This tutorial lists what you can put in the output statement of a step with simple examples and you should refer to other tutorials for more in-depth discussions of the topics.

Steps with no output statement

The output statement is optional. When no output file is defined, a step will have undefined output.

For example, the following workflow has a step A that execute a simple shell script. No output statement is needed and the workflow will work just fine.

do something
The input of step A_2 is ""

In simple workflows with numerically indexed steps, an empty output will be passed to the next step.

Unnamed output files

The easiest way to explicitly specify input of a step is to list output files directly in the output statement.

_output is a.txt

Here we showed touch function for _output, which is of type sos_targets. This function creates one or more files in variable _output and will be used quite often in the tutorials because SoS will check if the output file exists after the execution of the step.

As for the case of input statement, multiple files can be listed as multiple paramters, sequences (list, tuple etc), or variables of string or sequence types.

Output in substeps

The output statement can define output for a single substep or all substeps. That is to say,

If the output targets are ungrouped, it defines _output. step_output would be an accumulated version of _output.
If the output targets are grouped with options group_by or for_each, it defines step_output, which should have the same number of groups as step_input

Let us create a few input files,

The output statement usually defines output of a single substep. In the following example, option group_by creates two substeps with _input being a.txt and b.txt respectively. The _input (actually _input[0] is of type file_target, which is derived from pathlib.Path so you can use any member function for pathlib.Path. Here we use with_suffix to obtain a.bak from a.txt.

Input of substep is a.txt, output of substep is a.bak
Input of substep is b.txt, output of substep is b.bak

As you can see, _output is defined for each substep from _input. But what is step_output?

step_output is defined as an accumuted version of _output, with _output as its groups. It is useful only when the output is imported to other steps, either implicitly as show below, or as output of functions output_from and named_output.

Input of substep is a.txt, output of substep is a.bak
Input of substep is b.txt, output of substep is b.bak
step_input is a.bak b.bak, substep input is a.bak
step_input is a.bak b.bak, substep input is b.bak

SoS substeps must produce different sets of _output. The following workflow will fail to execute because both substeps will attemp to produce a.bak.

RuntimeError: Failed to process step output ('a.bak'): Output a.bak from substep 1 of 2 substeps overlaps with output from a previous substep.

Output with predefined groups (option `group_by`)

In situations when you have predefined input and output pairs, you can define output targets with groups using option group_by. The key here is that the number of groups should match the number of substeps. Technically speaking the output statement defines step_output and each substep takes one group as its _output.

For example,

Input of substep is a.txt, output of substep is a.bak
Input of substep is b.txt, output of substep is b.bak

Named output

Similar to named input, you can assign labels to output files and refer them with _output["label"].

Output with label A is a.txt, with label B is b.txt
Output of step is a.txt b.txt

More importantly though, is that these labels defines named output that can be referred to with function named_output.

Input of step is a.txt

Attach variables to individual output files

The paired_with variables can be used to attach variables to output files.

Output of substep is a.txt b.txt, with sample names A and B

Attach variables to output

Option group_with can be used to attach variable to output groups, which can be useful as annotations for output files when the output is passed to other steps.

A potentially confusing part of the group_with option is that it assigns elements to either _output or step_output, depending on how output statement is defined. If the output does not have group_by and for_each option, it defines a single _output and group_with should assign a single element to _output of this specific substep:

Output of substep is out_A.txt, with sample name A
Output of substep is out_B.txt, with sample name B

If you would like to attach some result to individual substep, it can be easier to just set the variable to _output though.

seed of output out_0.txt is 577
seed of output out_1.txt is 209

The output statement