- Difficulty level: easy
- Time need to lean: 10 minutes or less
- Key points:
- Input files are specified with the
inputstatement, which defines variable_input - Output files are specified with the
outputstatement, which defines variable_output - Input files can be processed in groups with the
group_byoption
- Input files are specified with the
Taking again the example workflow from our first tutorial, we have defined variables such as excel_file and used them directly in the scripts.
You can add an input and an output statement to the steps and write the workflow as
Comparing the two workflows, you will notice that steps in the new workflow have input and output statements that define the input and output of the steps, and two magic variables _input and _output are used in the scripts. These two variables are of type sos_targets and are of vital importance to the use of SoS.
The input and output statements notify SoS the input and output of the steps and allow SoS to handle them in a much more intelligent way. One of the most useful usages is the definition of substeps that allows SoS to process groups of input one by one, and/or the same groups of input with different sets of variables (option for_each, which will be discussed later).
Let us assume that we have two input files data/S20_R1.fastq and data/S20_R2.fastq and we would like to check the quality of them using a tool called fastqc. Using a plain Python approach and the sh action, the analysis can be performed by
Or using the input statement to define variable _input with two files, and use a (slightly more convenient but less Pythonic) indented script format:
There are two problems with this approach,
- The action
sh, either in function call format or indented script format, is less readable, especially if the script is long, and more importantly, - The input files are handled one by one although they are independent and can be processed in parallel
To address these problems, you can write the step as follows:
Substeps created by the group_by input option
- The
group_byoption groups input files and creates multiple groups of input files - Multiple substeps are created for each group of input files
- The input of each substep is stored in variable
_input - The substeps are by default executed in parallel
In this example, option group_by=1 divides the two input files into two groups, each with one input file. Two substeps are created from the groups. They execute the same step process (statements after the input statement) but with different values of variable _input. The sh action is written in the script format, which can be a lot more readable if the script is long. The substeps are executed in parallel so the step could be completed a lot faster than the for look version.
The output statement
- The
outputstatement defines the output of each substep, represented by variable_output. - The output of the entire step consists of
_outputfrom each substep.
The input statement defines input of the entire step, and optionally input of each substep as variable _input. The output statement, however, defines the output of each substep.
In the following example, the two input files are divided into two groups, reprented by _input for each substep. The output statement defines a variable _output for each substep.
Special format specification for _input objects
SoS variables _input and _output are of type sos_targets and accept additional format specifications. For example,
:nis the name of the path. e.g.f'{_input:n}'returns/path/to/aif_inputis/path/to/a.txt:bis the basename of the path. e.g.a.txtfrom/path/to/a.txt:dis the directory name of the path. e.g./path/tofrom/path/to/a.txt
The output statement of this example is
output: f'{_input:n}_fastqc.html'
which takes the name of _input and add _fastqc.html. For example, if _input = 'data/S20_R1.fastq', the corresponding _output = 'data/S20_R1_fastqc.html.
With this output statement, SoS will, among many other things, check if the output is properly generated after the completion of each substep, and returns an output object with the _output of each substep.