- Difficulty level: easy
- Time need to lean: 10 minutes or less
- Key points:
- Input files are specified with the
input
statement, which defines variable_input
- Output files are specified with the
output
statement, which defines variable_output
- Input files can be processed in groups with the
group_by
option
- Input files are specified with the
Taking again the example workflow from our first tutorial, we have defined variables such as excel_file
and used them directly in the scripts.
You can add an input
and an output
statement to the steps and write the workflow as
Comparing the two workflows, you will notice that steps in the new workflow have input
and output
statements that define the input and output of the steps, and two magic variables _input
and _output
are used in the scripts. These two variables are of type sos_targets
and are of vital importance to the use of SoS.
The input
and output
statements notify SoS the input and output of the steps and allow SoS to handle them in a much more intelligent way. One of the most useful usages is the definition of substeps that allows SoS to process groups of input one by one, and/or the same groups of input with different sets of variables (option for_each
, which will be discussed later).
Let us assume that we have two input files data/S20_R1.fastq
and data/S20_R2.fastq
and we would like to check the quality of them using a tool called fastqc. Using a plain Python approach and the sh
action, the analysis can be performed by
Or using the input
statement to define variable _input
with two files, and use a (slightly more convenient but less Pythonic) indented script format:
There are two problems with this approach,
- The action
sh
, either in function call format or indented script format, is less readable, especially if the script is long, and more importantly, - The input files are handled one by one although they are independent and can be processed in parallel
To address these problems, you can write the step as follows:
Substeps created by the group_by
input option
- The
group_by
option groups input files and creates multiple groups of input files - Multiple substeps are created for each group of input files
- The input of each substep is stored in variable
_input
- The substeps are by default executed in parallel
In this example, option group_by=1
divides the two input files into two groups, each with one input file. Two substeps are created from the groups. They execute the same step process (statements after the input
statement) but with different values of variable _input
. The sh
action is written in the script format, which can be a lot more readable if the script is long. The substeps are executed in parallel so the step could be completed a lot faster than the for
look version.
The output
statement
- The
output
statement defines the output of each substep, represented by variable_output
. - The output of the entire step consists of
_output
from each substep.
The input
statement defines input of the entire step, and optionally input of each substep as variable _input
. The output
statement, however, defines the output of each substep.
In the following example, the two input files are divided into two groups, reprented by _input
for each substep. The output statement defines a variable _output
for each substep.
Special format specification for _input
objects
SoS variables _input
and _output
are of type sos_targets
and accept additional format specifications. For example,
:n
is the name of the path. e.g.f'{_input:n}'
returns/path/to/a
if_input
is/path/to/a.txt
:b
is the basename of the path. e.g.a.txt
from/path/to/a.txt
:d
is the directory name of the path. e.g./path/to
from/path/to/a.txt
The output statement of this example is
output: f'{_input:n}_fastqc.html'
which takes the name of _input
and add _fastqc.html
. For example, if _input = 'data/S20_R1.fastq'
, the corresponding _output = 'data/S20_R1_fastqc.html
.
With this output statement, SoS will, among many other things, check if the output is properly generated after the completion of each substep, and returns an output object with the _output
of each substep.