- Difficulty level: easy
- Time need to lean: 10 minutes or less
- Key points:
- A forward-style workflow consists of numerically numbered steps
- Multiple workflows can be defined in a single SoS script or notebook
- Optional input and output statements can be added to change how workflows are executed
- The default input of a step if the output of its previous step
The workflows you have seen so far have numerically numbered steps. For example, this example from the first tutorial have a single workflow plot with steps plot_10 and plot_20 and SoS will execute the steps in numeric ordering.
Workflows with numerically numbered steps
- Steps have the format of
name_index(e.g.step_10) wherenameis the name of the workflow, andindexis a numeric number - Step indexes are usually not consecutive to allow easy insertion of new steps
- Both workflow
nameandindexcan be ignored.10,20etc are considered as steps of an unnamed workflow,stepis considered as a one-step workflow - Workflow can be executed by workflow name (e.g.
%run namein Jupyter,sos run namefrom command line). A default workflow will be execute if only one workflow is defined, or a default workflow is defined
The workflow is executed by default with magic %run because only one workflow is defined in the script. You can also define multiple workflows and execute them by their names. For example, the following script defines two single-step workflows convert and plot. Because there is no default workflow, you will have to refer to them with their names:
%run convert
and
%run plot
As shown in in How to specify input and output files and process input files in groups, you can define input and output for each step.
Default input of numerically numbered workflows
The default input of a step in a numerically numbered workflow is the output of its previous step
Therefore, in the following workflow, the input statement of plot_20 can be ignored.
We have shown the same workflows in the plot_10, plot_20 style, in the convert and plot style, and with and without specification of input and output. What will happen if you define a workflow in separate steps with input and output statements?
Let us first remove the intermediate DEG.csv,
and execute the plot step of the following workflow
Simple data-flow based workflow
If the input files of a step do not exist, SoS will automatically check other steps in the workflow and call them to generate the needed files. This allows the creation of workflows based on data flow.
As you can see, although the step plot is requested, SoS executes both the convert and plot steps because the required input file csv_file (DEG.csv) does not exist. In this case, SoS will look for steps that produces DEG.csv and execute it to generate DEG.csv before plot is executed.
Output of data-flow based workflow
For output files to be automatically identified by SoS as input for another step, the output statement must be clearly defined. That is to say, they must be either
- One or more filenames (e.g.
output: "DEG.csv") or - Some expression that can be easily evaluated from variables defined in the global section (e.g.
output: csv_file)
So output derived from `_input` cannot be used (e.g. output: _input.with_suffix('.bak')). However,
- You can assign complex output with a name and use
named_output()to refer to it. - You can create makefile-style steps and allows the creation of files through pattern-matching.
Output of a step with substeps
If a step has multiple substeps, the step output consists of _output from each substep, which will be by default passed to the next step and create multiple substeps.
Things can get a little bit complicated when a step has multiple substeps. As you can recall from How to specify input and output files and process input files in groups, multiple substeps can be defined by input option group_by, each with its own _input and _output. When the output of such a step is inherited by another step, these _output will become the _input of the substeps.
For example, after running fastqc on the input fastq files, we would like to process the generated HTML file and check if the qualities are ok. We use the beautifulsoup Python module and find all the <h2> headers. Without going into the details of the use of beautifulsoup to parse HTML files, you should notice that
- No
inputis defined for step20so it takes the output of step10as its input. - The output of step
10contains two groups,data/S20_R1_fastqc.htmlanddata/S20_R2_fastqc.html, which becomes the input of two substeps of step20. - The input of step
20are processed one by one
If you would like to re-group the default input, you can redefine the input explicitly, or apply option group_by to the default input: