- Difficulty level: intermediate
- Time need to lean: 20 minutes or less
- Key points:
- Option
provides
extends the "data-flow" style workflow that allows steps to generate different outputs. - Steps with
provides
section option has defaultstep_output
.input
can be derived from pattern-matched variables.
- Option
Auxiliary steps are special steps that are executed to provide targets that are required by others.
For example, when the following step is executed with an input file bamfile
(with extension .bam
), it checks the existence of input file (bamfile
), and a dependent index file (with extension .bam.bai
).
[100 (call variant)]
input: bamfile
depends: bamfile + '.bai'
run:
# commands to call variants from
# input bam file
Because the step depends on an index file, SoS will look in the script for a step that provides such a target, which would be similar to
[index_bam : provides='{sample}.bam.bai']
input: f"{sample}.bam"
run: expand=True
samtools index {_input}
Such a step is characterized by a provides
option (or a step with simple output
statement and is called an auxiliary step. In this particular case, if bamfile="AS123.bam"
, the requested dependent file would be AS123.bam.bai
. Through the matching mechanism of option provides
, the index_bam
step would be executed with variables sample="AS123"
and step_output="AS123.bam.bai"
.
Unless option -T
(see tracing dependency for details) is specified, SoS will not check if an depdendent target can be generated by an auxiliary step if it already exists. In an extreme case, sos run -t filename
will quit directly if filename
already exists.
An auxiliary step can trigger other auxiliary steps and form a DAG (Directed Acyclic Graph). Acutually, you can write workflows in a make-file style with all auxiliary steps and execute workflows defined by targets. If you are familiar with Makefile, especially snakemake, it can be natural for you to implement your workflow in this style. The advantage of SoS is that you can use either or both forward-style and makefile-style steps to define your workflow and take advantages of both approaches.
An auxiliary step is defined by the provides
option in section head, in the format of
[step_name : provides=target]
where target
can be
- A filename or file pattern such as
"{sample}.bam.idx"
- Other types of targets such as
executable("ms")
- A list (sequence) of one or more file patterns and targets.
A file pattern is a filename with optional patterns with variable names enbraced in { }
. SoS matches filenames with the patterns and, if successful, assign variables with matched parts of the names. For example,
[compress: provides = '{filename}.bam']
would be triggered with target sample_A.bam
and sample_B.bam
. When the step is triggered by sample_A.bam
, it defines variable filename
as sample_A
and sets the output of the step as sample_A.bam
.
The following example removes all local *.bam
and *.bam.bi
file before it executes three workflows defined by targets
. We use magic %run
to execute it, which is equivalent to executing it from command line using commands such as
sos run myscript -t TS1.bam
Let us create a workflow with two auxiliary steps compress
and index
. The compress
step generates a .bam
file (no input here for simplicity) and index
step creates a .bam.bai
file from the .bam
file.
If we only want to generate a bam
file (with option -t TS1.bam
), the compress
step is executed
If we would like to generate both .bam
and .bam.bai
files (with option -t TS2.bam.bai
), both steps are executed.
As you can see from the output, when the first workflow is executed with target TS1.bam
, step compress
is executed to produce it. In the run, both steps are executed to generate TS2.bam
and then TS2.bam.bai
.
In addition to output files, an auxiliary step can provide targets of other types. A most widely used target is sos_variable
, which provides variables that can be accessed by later steps. For example,
However, for this particular example, it is more straightforward to return the variable with option shared
as follows:
You can specify multiple targets to the provides
option. A step would be triggered if any of the targets matches.
For example, the temp
step is triggered twice in the following example, first time by target text.bak
and the second time by target text.tmp
.
However, depending on what the auxiliary step is designed, it might be generating multiple output at the same time and it would be wasteful to execute the step multiple times. In this case, you can define an output
statement and let SoS know that the execution of the step generates multiple targets.
Technically speaking, the provides
option will generate a default step_output
, which is the matched filename, which is a single file text.bak
or text.tmp
when SoS tries to find a step to generate it. With an explicit output
statement, any of the text.bak
or text.tmp
will lead to filename='text'
and an step_output
of both text.bak
and text.tmp
.