- Difficulty level: intermediate
- Time need to lean: 30 minutes or less
- Key points:
- Input statements accepts regular Python aguments to specify input targets of steps
- Input files can be grouped to create substeps, and can be labeled and be accessed by the labels
- A step can include part or all output from other steps
- You can attach variables to individual input files or substeps
The input
statement defines the input files or targets of a SoS step, it is optional but is fundamental for the creation of all but very simple workflows. You can check out the How to create dependencies between SoS steps tutorial for a quick overview of the use of input statements. Here we list what you can put in the input
statement of a step with simple examples and you should refer to other tutorials for more in-depth discussions of the topics.
The input
statement is optional. When no input file is defined, a step will either have undefined input, or output from its previous step as its input.
For example, the following workflow has a step A
that execute a simple shell script. No input statement is needed and the workflow will work just fine.
In a special case when a workflow is defined with numerically indexed steps, a step without input statement will depend on its previous step and take its output as its input. Here we just present a very simple example and you will see more complex examples in other tutorials.
The easiest way to explicitly specify input of a step is to list input files directly in the input
statement. Because SoS checks the existence of input files when it executes a step, let us first create a few files:
The following is a SoS step (with a default section head) with a input
statement, which results in a step_input
variable with a single file a.txt
:
Multiple files can be listed as multiple paramters, sequences (list
, tuple
etc), or variables of string or sequence types. For example, you can define a parameter in_files
of type paths
(list of path
) and specify input files from command line:
You can list multiple files, mix string literals with variable names,
Because steps in these examples do not have any substep, it is equivalent to use variable _input
instead of step_input
.
A step can be executed multiple times with different variables, which are called substeps. The input of each substep is assigned to variable _input
. The most common way to define substeps are using option group_by
to group input files.
For example,
Another way to create substeps are repeating the step with different values of a variable. For example, in the following example, a variable val
is defined to iterate through a list [1, 2]
and it will create two substeps with val=1
and val=2
respectively.
You can assign subsets of your input files some labels and refer to these subsets with the labels.
Variables step_input
and _input
are of type sos_targets
, which consists of SoS targets, most of which are file_targets
. All targets have a dictionary that can be used to store attributes related to them.
For example, by pairing a list of sample names to a list of input files, the attribute sample_name
is attached to each input file and can be accessible through .sample_name
.
The variables are attached to individual input files so they will be available with the files in substeps:
When we group input files, we can attach attribute to the entire group presented by variable _input
. This is done through option group_with
.
For example, in the following workflow, 4 files are grouped into two groups each with two files. The two groups are attached with items in list ['AB', 'CD']
with name sample_name
, and can be accessed with _input.sample_name
.
For simplicity (and backward compatibility), the substep variables can be accessed directly in substeps so that you can use sample_name
instead of _input.sample_name
.
The input statement accept python functions. A function named_input
is defined to include named output from another step.
named_output
can be used to refer to part of, if multiple named output exists, or all of the output from a step. Similarly, a function output_from
can be used to include complete output from specified step: