- Difficulty level: intermediate
- Time need to lean: 20 minutes or less
- Key points:- Option providesextends the "data-flow" style workflow that allows steps to generate different outputs.
- Steps with providessection option has defaultstep_output.inputcan be derived from pattern-matched variables.
 
- Option 
Auxiliary steps are special steps that are executed to provide targets that are required by others.
For example, when the following step is executed with an input file bamfile (with extension .bam), it checks the existence of input file (bamfile), and a dependent index file (with extension .bam.bai).
[100 (call variant)]
input:   bamfile
depends: bamfile + '.bai'
run:
    # commands to call variants from 
    # input bam file
Because the step depends on an index file, SoS will look in the script for a step that provides such a target, which would be similar to
[index_bam : provides='{sample}.bam.bai']
input: f"{sample}.bam"
run: expand=True
     samtools index {_input}
Such a step is characterized by a provides option (or a step with simple output statement and is called an auxiliary step. In this particular case, if bamfile="AS123.bam", the requested dependent file would be AS123.bam.bai. Through the matching mechanism of option provides, the index_bam step would be executed with variables sample="AS123" and step_output="AS123.bam.bai".
Unless option -T (see tracing dependency for details) is specified, SoS will not check if an depdendent target can be generated by an auxiliary step if it already exists. In an extreme case, sos run -t filename will quit directly if filename already exists.
An auxiliary step can trigger other auxiliary steps and form a DAG (Directed Acyclic Graph). Acutually, you can write workflows in a make-file style with all auxiliary steps and execute workflows defined by targets. If you are familiar with Makefile, especially snakemake, it can be natural for you to implement your workflow in this style. The advantage of SoS is that you can use either or both forward-style and makefile-style steps to define your workflow and take advantages of both approaches.
An auxiliary step is defined by the provides option in section head, in the format of
[step_name : provides=target]
where target can be
- A filename or file pattern such as "{sample}.bam.idx"
- Other types of targets such as executable("ms")
- A list (sequence) of one or more file patterns and targets.
A file pattern is a filename with optional patterns with variable names enbraced in { }. SoS matches filenames with the patterns and, if successful, assign variables with matched parts of the names. For example,
[compress: provides = '{filename}.bam']
would be triggered with target sample_A.bam and sample_B.bam. When the step is triggered by sample_A.bam, it defines variable filename as sample_A and sets the output of the step as sample_A.bam.
The following example removes all local *.bam and *.bam.bi file before it executes three workflows defined by targets. We use magic %run to execute it, which is equivalent to executing it from command line using commands such as
sos run myscript -t TS1.bam
Let us create a workflow with two auxiliary steps compress and index. The compress step generates a .bam file (no input here for simplicity) and index step creates a .bam.bai file from the .bam file.
If we only want to generate a bam file (with option -t TS1.bam), the compress step is executed
If we would like to generate both .bam and .bam.bai files (with option -t TS2.bam.bai), both steps are executed.
As you can see from the output, when the first workflow is executed with target TS1.bam, step compress is executed to produce it. In the run, both steps are executed to generate TS2.bam and then TS2.bam.bai.
In addition to output files, an auxiliary step can provide targets of other types. A most widely used target is sos_variable, which provides variables that can be accessed by later steps. For example,
However, for this particular example, it is more straightforward to return the variable with option shared  as follows:
You can specify multiple targets to the provides option. A step would be triggered if any of the targets matches.
For example, the temp step is triggered twice in the following example, first time by target text.bak and the second time by target text.tmp.
However, depending on what the auxiliary step is designed, it might be generating multiple output at the same time and it would be wasteful to execute the step multiple times. In this case, you can define an output statement and let SoS know that the execution of the step generates multiple targets.
Technically speaking, the provides option will generate a default step_output, which is the matched filename, which is a single file text.bak or text.tmp when SoS tries to find a step to generate it. With an explicit output statement, any of the text.bak or text.tmp will lead to filename='text' and an step_output of both text.bak and text.tmp.