The `input` statement

Difficulty level: intermediate
Time need to lean: 30 minutes or less
Key points:
- Input statements accepts regular Python aguments to specify input targets of steps
- Input files can be grouped to create substeps, and can be labeled and be accessed by the labels
- A step can include part or all output from other steps
- You can attach variables to individual input files or substeps

The input statement defines the input files or targets of a SoS step, it is optional but is fundamental for the creation of all but very simple workflows. You can check out the How to create dependencies between SoS steps tutorial for a quick overview of the use of input statements. Here we list what you can put in the input statement of a step with simple examples and you should refer to other tutorials for more in-depth discussions of the topics.

Steps with no input statement

The input statement is optional. When no input file is defined, a step will either have undefined input, or output from its previous step as its input.

For example, the following workflow has a step A that execute a simple shell script. No input statement is needed and the workflow will work just fine.

do something

In a special case when a workflow is defined with numerically indexed steps, a step without input statement will depend on its previous step and take its output as its input. Here we just present a very simple example and you will see more complex examples in other tutorials.

[##] 2 steps processed (2 jobs completed)

Unnamed input files

The easiest way to explicitly specify input of a step is to list input files directly in the input statement. Because SoS checks the existence of input files when it executes a step, let us first create a few files:

The following is a SoS step (with a default section head) with a input statement, which results in a step_input variable with a single file a.txt:

step_input is a.txt

Multiple files can be listed as multiple paramters, sequences (list, tuple etc), or variables of string or sequence types. For example, you can define a parameter in_files of type paths (list of path) and specify input files from command line:

[#] 1 step processed (1 job completed)

You can list multiple files, mix string literals with variable names,

step_input is a.txt b.txt c.txt d.txt

Because steps in these examples do not have any substep, it is equivalent to use variable _input instead of step_input.

Substep created by option `group_by`

A step can be executed multiple times with different variables, which are called substeps. The input of each substep is assigned to variable _input. The most common way to define substeps are using option group_by to group input files.

For example,

Input of substep is a.txt b.txt
Input of substep is c.txt d.txt

Substep craeted by option `for_each`

Another way to create substeps are repeating the step with different values of a variable. For example, in the following example, a variable val is defined to iterate through a list [1, 2] and it will create two substeps with val=1 and val=2 respectively.

Processing a.txt with 1
Processing a.txt with 2

Named input

You can assign subsets of your input files some labels and refer to these subsets with the labels.

Step input is a.txt b.txt. Inputs with label A is a.txt. Input with label B is b.txt

Attach variables to individual input files

Variables step_input and _input are of type sos_targets, which consists of SoS targets, most of which are file_targets. All targets have a dictionary that can be used to store attributes related to them.

For example, by pairing a list of sample names to a list of input files, the attribute sample_name is attached to each input file and can be accessible through .sample_name.

Input of substep is a.txt b.txt, with sample names A and B

The variables are attached to individual input files so they will be available with the files in substeps:

Input of substep is a.txt, with sample names A
Input of substep is b.txt, with sample names B

Attach variables to substeps

When we group input files, we can attach attribute to the entire group presented by variable _input. This is done through option group_with.

For example, in the following workflow, 4 files are grouped into two groups each with two files. The two groups are attached with items in list ['AB', 'CD'] with name sample_name, and can be accessed with _input.sample_name.

Input of substep is a.txt b.txt, with sample name AB
Input of substep is c.txt d.txt, with sample name CD

For simplicity (and backward compatibility), the substep variables can be accessed directly in substeps so that you can use sample_name instead of _input.sample_name.

Input of substep is a.txt b.txt, with sample name AB
Input of substep is c.txt d.txt, with sample name CD

Named input with function `named_input`

The input statement accept python functions. A function named_input is defined to include named output from another step.

[##] 2 steps processed (2 jobs completed)

Output from another step using function `output_from`

named_output can be used to refer to part of, if multiple named output exists, or all of the output from a step. Similarly, a function output_from can be used to include complete output from specified step:

[##] 2 steps processed (2 jobs completed)

The input statement