Output from another step

Difficulty level: intermediate
Time need to lean: 10 minutes or less
Key points:
- Function output_from(step) refers to output from another step
- output_from(step)[name] can be used to refer to named output from step

Referring to named output from another step

As shown in the example from tutorial How to use named output in data-flow style workflows, function named_output can be used to refer to named output from another step:

xlsx2csv data/DEG.xlsx > data/DEG.csv

One obvious limitation of named_output() is that the name has to be unique in the workflow. For example, in the following script where another step test_csv also gives its output a name csv, the workflow would fail due to ambiguity. This is usually not a concern with small workflows. However, when workflows get more and more complex, it is sometimes desired to anchor named output more precisely.

ERROR: Multiple steps convert, test_csv to generate target named_output("csv")

RuntimeError: Workflow exited with code 1

Function `output_from`

Function `output_from(steps, group_by, ...)`

Function output_from refers to the output of step. The returned the object is the complete output from step with its own sources and groups. Therefore,

More than one steps can be specified as a list of step names
Option group_by can be used to regroup the returned files
output_from(step)[name] refers to all output with source name

Function output_from imports the output from one or more other steps. For example, in the following workflow output_from(['step_10', 'step_20']) takes the output from steps step_10 and step_20 as input.

[###] 3 steps processed (1 job completed, 2 jobs ignored)

The above example is a simple forward workflow with numerically numbered steps. In this case the parameters of output_from can be simplied to just the indexes (integers) so the workflow can be written as

[###] 3 steps processed (1 job completed, 2 jobs ignored)

The source steps of output_from(steps) does not have to be limited to numerically-indexed steps. For example, the above example can be written as:

[###] 3 steps processed (3 jobs completed)

`labels` of outputs returned from `output_from`

The sources of the files returned from output_from() is by default the names of the steps so you can refer to these files separately using the _input[name] syntax:

[###] 3 steps processed (1 job completed, 2 jobs ignored)

If the output has its own sources (names), the sources will be kept.

[###] 3 steps processed (3 jobs completed)

As usual, keyword arguments of the input statement override the sources of input files:

[###] 3 steps processed (2 jobs completed, 1 job ignored)

Groups of output returned from `output_from`

Similar to the case with named_output, the returned object from output_from() keeps its original groups. For example,

[##] 2 steps processed (8 jobs ignored)

You can override the groups using the group_by option of output_from.

[##] 2 steps processed (6 jobs ignored)

Note that we used

_input.with_suffix('.bak')

when _input contains only one filename and the above the statement is equivalent to

_input[0].with_suffix('.bak')

However, when _input contains more than one files, you will have to deal with them one by one as follows:

[x.with_suffix('.bak') for x in _input]

Using `output_from` in place of `named_output`

Going back to our convert, plot example. When another step is added to have the same named output, it is no longer possible to use named_output(name). In this case you can explicitly specify the step from which the named output is defined, and use

output_from(step)[name]

instead of

named_output(name)

as shown in the following example:

xlsx2csv data/DEG.xlsx > data/DEG.csv

Note that output_from is better than named_output for its ability to referring to a specific step, but is also worse than named_output for the same reason because it makes the workflow more difficult to maintain. We generally recommend the use of named_output for its simplicity.

`output_from()` with skipped substeps

Function output_from() obtains outputs, actually substeps output from another step. There is, however, a case when a substep is skipped and leaves no output. In this case, the substep output is dicarded.

For example, when a substep in the step A of the following workflow is skipped, the result from output_from('A') contains only the output of valid substeps.

[##] 2 steps processed (4 jobs completed, 3 jobs ignored)

However, if you would like to keep consistent number of substeps across steps, you can handle get output from all substeps by using option remove_empty_groups=False.

[##] 2 steps processed (4 jobs completed, 3 jobs ignored)

Output from a workflow

Function `output_from(workflow_name)`

output_from(workflow_name) is equivalent to output_from(workflow_name_index) where index is the largest index of the workflow workflow_name

Function output_from is usually used to refer the output of a specific step. However, similar to target sos_step that can refer to a numerically indexed workflow, output_from can also accept the name of the workflow and returns the output of the last step of the workflow.

For example, in the following workflow, output_from('A') is used to obtain the output of step A_2, which is the last step of the workflow A. Although output_from('A') is identical to output_from('A_2'), it frees you from specifying the index of the last step of the workflow, and is more intuitive to think output_from('A') as the output of the workflow.

[###] 3 steps processed (2 jobs completed, 1 job ignored)

Output from another step

Referring to named output from another step

Function output_from

Function output_from(steps, group_by, ...)

labels of outputs returned from output_from

Groups of output returned from output_from

Using output_from in place of named_output

output_from() with skipped substeps