Edit this page on our live server and create a PR by running command !create-pr in the console panel

Step input, output, and substeps

  • Difficulty level: easy
  • Time need to lean: 10 minutes or less
  • Key points:
    • Input files are specified with the input statement, which defines variable _input
    • Output files are specified with the output statement, which defines variable _output
    • Input files can be processed in groups with the group_by option

Specifying step input and output

Taking again the example workflow from our first tutorial, we have defined variables such as excel_file and used them directly in the scripts.

In [1]:

You can add an input and an output statement to the steps and write the workflow as

In [2]:

Comparing the two workflows, you will notice that steps in the new workflow have input and output statements that define the input and output of the steps, and two magic variables _input and _output are used in the scripts. These two variables are of type sos_targets and are of vital importance to the use of SoS.

Substeps and input option group_by

The input and output statements notify SoS the input and output of the steps and allow SoS to handle them in a much more intelligent way. One of the most useful usages is the definition of substeps that allows SoS to process groups of input one by one, and/or the same groups of input with different sets of variables (option for_each, which will be discussed later).

Let us assume that we have two input files data/S20_R1.fastq and data/S20_R2.fastq and we would like to check the quality of them using a tool called fastqc. Using a plain Python approach and the sh action, the analysis can be performed by

In [3]:
Started analysis of S20_R1.fastq
Analysis complete for S20_R1.fastq
Started analysis of S20_R2.fastq
Analysis complete for S20_R2.fastq

Or using the input statement to define variable _input with two files, and use a (slightly more convenient but less Pythonic) indented script format:

In [4]:
Started analysis of S20_R1.fastq
Analysis complete for S20_R1.fastq
Started analysis of S20_R2.fastq
Analysis complete for S20_R2.fastq

There are two problems with this approach,

  • The action sh, either in function call format or indented script format, is less readable, especially if the script is long, and more importantly,
  • The input files are handled one by one although they are independent and can be processed in parallel

To address these problems, you can write the step as follows:

In [5]:
Started analysis of S20_R1.fastq
Analysis complete for S20_R1.fastq
Started analysis of S20_R2.fastq
Analysis complete for S20_R2.fastq

In this example, option group_by=1 divides the two input files into two groups, each with one input file. Two substeps are created from the groups. They execute the same step process (statements after the input statement) but with different values of variable _input. The sh action is written in the script format, which can be a lot more readable if the script is long. The substeps are executed in parallel so the step could be completed a lot faster than the for look version.

Output of substeps

The input statement defines input of the entire step, and optionally input of each substep as variable _input. The output statement, however, defines the output of each substep.

In the following example, the two input files are divided into two groups, reprented by _input for each substep. The output statement defines a variable _output for each substep.

In [6]:
Started analysis of S20_R1.fastq
Analysis complete for S20_R1.fastq
Started analysis of S20_R2.fastq
Analysis complete for S20_R2.fastq

The output statement of this example is

output: f'{_input:n}_fastqc.html'

which takes the name of _input and add _fastqc.html. For example, if _input = 'data/S20_R1.fastq', the corresponding _output = 'data/S20_R1_fastqc.html.

With this output statement, SoS will, among many other things, check if the output is properly generated after the completion of each substep, and returns an output object with the _output of each substep.