- Difficulty level: easy
- Time need to lean: 10 minutes or less
- Key points:
- A forward-style workflow consists of numerically numbered steps
- Multiple workflows can be defined in a single SoS script or notebook
- Optional input and output statements can be added to change how workflows are executed
- The default input of a step if the output of its previous step
The workflows you have seen so far have numerically numbered steps. For example, this example from the first tutorial have a single workflow plot
with steps plot_10
and plot_20
and SoS will execute the steps in numeric ordering.
Workflows with numerically numbered steps
- Steps have the format of
name_index
(e.g.step_10
) wherename
is the name of the workflow, andindex
is a numeric number - Step indexes are usually not consecutive to allow easy insertion of new steps
- Both workflow
name
andindex
can be ignored.10
,20
etc are considered as steps of an unnamed workflow,step
is considered as a one-step workflow - Workflow can be executed by workflow name (e.g.
%run name
in Jupyter,sos run name
from command line). A default workflow will be execute if only one workflow is defined, or a default workflow is defined
The workflow is executed by default with magic %run
because only one workflow is defined in the script. You can also define multiple workflows and execute them by their names. For example, the following script defines two single-step workflows convert
and plot
. Because there is no default workflow, you will have to refer to them with their names:
%run convert
and
%run plot
As shown in in How to specify input and output files and process input files in groups, you can define input
and output
for each step.
Default input
of numerically numbered workflows
The default input of a step in a numerically numbered workflow is the output of its previous step
Therefore, in the following workflow, the input
statement of plot_20
can be ignored.
We have shown the same workflows in the plot_10
, plot_20
style, in the convert
and plot
style, and with and without specification of input and output. What will happen if you define a workflow in separate steps with input
and output
statements?
Let us first remove the intermediate DEG.csv
,
and execute the plot
step of the following workflow
Simple data-flow based workflow
If the input
files of a step do not exist, SoS will automatically check other steps in the workflow and call them to generate the needed files. This allows the creation of workflows based on data flow.
As you can see, although the step plot
is requested, SoS executes both the convert
and plot
steps because the required input file csv_file
(DEG.csv
) does not exist. In this case, SoS will look for steps that produces DEG.csv
and execute it to generate DEG.csv
before plot
is executed.
Output of data-flow based workflow
For output files to be automatically identified by SoS as input for another step, the output
statement must be clearly defined. That is to say, they must be either
- One or more filenames (e.g.
output: "DEG.csv"
) or - Some expression that can be easily evaluated from variables defined in the global section (e.g.
output: csv_file
)
So output derived from `_input` cannot be used (e.g. output: _input.with_suffix('.bak')
). However,
- You can assign complex output with a name and use
named_output()
to refer to it. - You can create makefile-style steps and allows the creation of files through pattern-matching.
Output of a step with substeps
If a step has multiple substeps, the step output consists of _output
from each substep, which will be by default passed to the next step and create multiple substeps.
Things can get a little bit complicated when a step has multiple substeps. As you can recall from How to specify input and output files and process input files in groups, multiple substeps can be defined by input option group_by
, each with its own _input
and _output
. When the output of such a step is inherited by another step, these _output
will become the _input
of the substeps.
For example, after running fastqc
on the input fastq files, we would like to process the generated HTML file and check if the qualities are ok. We use the beautifulsoup
Python module and find all the <h2>
headers. Without going into the details of the use of beautifulsoup
to parse HTML files, you should notice that
- No
input
is defined for step20
so it takes the output of step10
as its input. - The output of step
10
contains two groups,data/S20_R1_fastqc.html
anddata/S20_R2_fastqc.html
, which becomes the input of two substeps of step20
. - The input of step
20
are processed one by one
If you would like to re-group the default input, you can redefine the input
explicitly, or apply option group_by
to the default input: