Edit this page on our live server and create a PR by running command !create-pr in the console panel

Makefile-style pattern-matching rules

  • Difficulty level: intermediate
  • Time need to lean: 20 minutes or less
  • Key points:
    • Option provides extends the "data-flow" style workflow that allows steps to generate different outputs.
    • Steps with provides section option has default step_output. input can be derived from pattern-matched variables.

Auxiliary steps

Auxiliary steps are special steps that are executed to provide targets that are required by others.

For example, when the following step is executed with an input file bamfile (with extension .bam), it checks the existence of input file (bamfile), and a dependent index file (with extension .bam.bai).

[100 (call variant)]
input:   bamfile
depends: bamfile + '.bai'
run:
    # commands to call variants from 
    # input bam file

Because the step depends on an index file, SoS will look in the script for a step that provides such a target, which would be similar to

[index_bam : provides='{sample}.bam.bai']
input: f"{sample}.bam"
run: expand=True
     samtools index {_input}

Such a step is characterized by a provides option (or a step with simple output statement and is called an auxiliary step. In this particular case, if bamfile="AS123.bam", the requested dependent file would be AS123.bam.bai. Through the matching mechanism of option provides, the index_bam step would be executed with variables sample="AS123" and step_output="AS123.bam.bai".

Unless option -T (see tracing dependency for details) is specified, SoS will not check if an depdendent target can be generated by an auxiliary step if it already exists. In an extreme case, sos run -t filename will quit directly if filename already exists.

An auxiliary step can trigger other auxiliary steps and form a DAG (Directed Acyclic Graph). Acutually, you can write workflows in a make-file style with all auxiliary steps and execute workflows defined by targets. If you are familiar with Makefile, especially snakemake, it can be natural for you to implement your workflow in this style. The advantage of SoS is that you can use either or both forward-style and makefile-style steps to define your workflow and take advantages of both approaches.

Step option provides

An auxiliary step is defined by the provides option in section head, in the format of

[step_name : provides=target]

where target can be

  • A filename or file pattern such as "{sample}.bam.idx"
  • Other types of targets such as executable("ms")
  • A list (sequence) of one or more file patterns and targets.

File Pattern

A file pattern is a filename with optional patterns with variable names enbraced in { }. SoS matches filenames with the patterns and, if successful, assign variables with matched parts of the names. For example,

[compress: provides = '{filename}.bam']

would be triggered with target sample_A.bam and sample_B.bam. When the step is triggered by sample_A.bam, it defines variable filename as sample_A and sets the output of the step as sample_A.bam.

The following example removes all local *.bam and *.bam.bi file before it executes three workflows defined by targets. We use magic %run to execute it, which is equivalent to executing it from command line using commands such as

sos run myscript -t TS1.bam

Let us create a workflow with two auxiliary steps compress and index. The compress step generates a .bam file (no input here for simplicity) and index step creates a .bam.bai file from the .bam file.

In [1]:
Cell content saved to test_provides.sos, use option -r to also execute the cell.

If we only want to generate a bam file (with option -t TS1.bam), the compress step is executed

In [2]:
> compress input to TS1.bam

If we would like to generate both .bam and .bam.bai files (with option -t TS2.bam.bai), both steps are executed.

In [3]:
> compress input to TS2.bam
> index TS2.bam to TS2.bam.bai

As you can see from the output, when the first workflow is executed with target TS1.bam, step compress is executed to produce it. In the run, both steps are executed to generate TS2.bam and then TS2.bam.bai.

Non-file targets

In addition to output files, an auxiliary step can provide targets of other types. A most widely used target is sos_variable, which provides variables that can be accessed by later steps. For example,

In [4]:
There are 94 notebooks in this directory

However, for this particular example, it is more straightforward to return the variable with option shared as follows:

In [5]:
There are 94 notebooks in this directory

Multiple targets

You can specify multiple targets to the provides option. A step would be triggered if any of the targets matches.

For example, the temp step is triggered twice in the following example, first time by target text.bak and the second time by target text.tmp.

In [6]:
Touch text.tmp
Touch text.bak

However, depending on what the auxiliary step is designed, it might be generating multiple output at the same time and it would be wasteful to execute the step multiple times. In this case, you can define an output statement and let SoS know that the execution of the step generates multiple targets.

In [7]:
Touch text.bak text.tmp

Technically speaking, the provides option will generate a default step_output, which is the matched filename, which is a single file text.bak or text.tmp when SoS tries to find a step to generate it. With an explicit output statement, any of the text.bak or text.tmp will lead to filename='text' and an step_output of both text.bak and text.tmp.