- Difficulty level: intemediate
- Time need to lean: 25 minutes or less
- Key points:
- All SoS step variables such as
_input,_output,step_inputandstep_outputare of typesos_targets sos_targetsconsists of labeled SoS targets, has optional groups, special format specification, and some member functions.
- All SoS step variables such as
A target is an object that can be created and detected. A SoS step can take a list of targets as input, check the existence of a list of dependent targets, and produce a list of targets as output. Input, output and dependent targets for steps and substeps are exposed to you as special variables _input, _output, _depends, step_input, and step_output that are all in type sos_targets.
sos_targets contains a list of targets (of type BaseTarget is your are curious), which can befile_target that represents a file on the file system, sos_variable that represents a defined variable, R_Library that represents a R library, or some other types. Please refer to SoS targets for details about SoS targets.
In SoS, the input statement mostly creates a step_input object with provided parameters. That is to say,
input: 'a.txt', 'b.txt', group_by=1
is almost equivalent to
step_input = sos_targets('a.txt', 'b.txt', group_by=1)
and we can use sos_targets objects directly in an input statement in more complicated cases.
Variable _input represents the input targets for each substep (groups of sos_targets as we will see later).
In the cases that a step contains only one substep, step_input is the same as _input. For example, variables step_input and _input of the following step are sos_targets objects with a single file_target object:
and if you have multiple input files, you can pass them altogether as a sos_targets with two file_target
or separately as two groups of inputs:
In this case, the step input contains two file_target:
step_input = sos_targets('SoS_Syntax.ipynb', 'SoS_Magics.ipynb')`
but the step process is executed twice, with
_input = sos_targets('SoS_Syntax.ipynb')
_input = sos_targets('SoS_Magics.ipynb')
respectively. Because _input contains only one element, it is not necessary to use _input[0] in the script.
sos_targets type keeps a list of BaseTargets objects. It can be initialized from one or more str (for file_target), or other targets. Lists of targets or dictionary of targets (discussed later) will be flattened and concatenated so the end result will always be an one-dimensional list.
The variables appear to be a sequence that can be sliced and iterated. For example, the following statement creates a sos_targets object with three filenames from a single filename and a list of two filenames:
You can access one or more elements of a sos_targets or iterate through it
To convert a paths object to a regular list, you can use function list
or slice part of the paths using slices
Under the hood paths are presented as type path (derived from pathlib.Path) and file targets are presented as type file_target that is derived from path. Paths that starts with ~ and # will be expanded automatically where
- Paths that starts with
~will be expanded withos.path.expanduser. - Paths that starts with
#namewill be expanded according to the hosts that the workflow is executed. Thenameshould be defined in the host definition under the keyspathsorshared.
Now, if the same workflow is executed on docker, a remote host with different #home, the output is different.
sos_targets accepts a list of format options to easily format path in different formats. Here is a summary of format options with their effects:
| convertor | operation | effect | operant | output |
|---|---|---|---|---|
a |
absolute path | abspath() |
test.sos |
/path/to/test.sos |
b |
base filename | basename()) |
{home}/SoS/test.sos |
test.sos |
e |
escape | replace(' ', '\\ ') |
file 1.txt |
file\ 1.txt |
d |
directory name | dirname() or '.' |
/path/to/test.sos |
/path/to |
l |
expand link | realpath() |
test.sos |
/realpath/to/test.sos |
n |
remove extension | splitext()[0] |
/path/to/test.sos |
/path/to/test |
p |
posix name | replace('\\', '/')... |
c:\\Users |
/c/Users |
q |
quote | quoted() |
file 1.txt |
'file 1.txt' |
r |
repr | repr() |
file.txt |
'file.txt' |
s |
str | str() |
file.txt |
file.txt |
U |
undo expanduser | replace(expanduser('~'), '~') |
/home/user/test.sos |
~/test.sos |
x |
file extension | splitext()[1] |
~/SoS/test.sos |
.sos |
, |
join with comma | ','.join() |
['a.txt', 'b.txt'] |
a.txt,b.txt |
These format options allow you to pass filenames to scripts in different formats. For example, it would be perfectly OK to pass ~/a.txt to a shell script, but a u formatter should be added if you are passing the filename to a script that does not understand ~ in filenames. For example,
An important difference between the formatting of sos_targets and regular lists of BaseTarget is that formatting are applied to each item and joint by space or comma. For example, whereas a regular list is formatted as a list
A sos_targets is formatted as
or separated by , with format option ","
or after formatting each element with specified formatter
One particular consequence of this format rule is that a sos_targets with only one element will be formatted exactly like a single target so you can use _input (a sos_targets) in place of _input[0] (a file_target) if you know there is only one target inside _input:
As a matter of fact, if a sos_targets has only one element, it will pass unrecognized attributes and functions to this element, so that
Basically, you can use _input exactly as _input[0] if there is only one file_target in _input.
Targets in sos_targets can be associated with arbitrary attributes. These attributes are usually assigned with option paired_with of an input statement.
Option paired_with accepts a dictionary and assigns attributes to each of the targets with specified values. For example,
Although targets and their attributes are usually set in an input statement, you can create targets and set attributes directly. For example
Here the target.set(name, value) function sets an attribute to the target, target.get(name, default=None) get the value of attribute name, and returns default if name is not a valid attribute. It is therefore a safer way to retrieve an attribute than target.name if you are uncertain if attribute name exists for target.
Targets in a sos_targets has an attribute label, which correspond to the step that the target is specified (input) or generated (output). For example, the label of a sos_targets that is directly specified in a step is the name of step.
If you have multiple inputs, you can sparate them into different groups using keyword arguments
If the input target is inherited from another step, the source will the name of that step.
In a more complex case when the source comes from multiple input steps and the present step, the labels attribute points out the source of each target:
Although the use of keyword argument will override the default source
The source information can be used to select subsets of targets according to their labels. For example, _intput['prev'] would generate a sos_targets with all targets from source prev.
As we have seen, targets in a sos_targets can be grouped in many ways and _input contains subsets of the targets and is the input for each substep. For example, in the following example, the 4 input files are grouped into two groups of the same size. The step is executed twice, each time for a different group. step_input.groups contains a list of sos_targets that becomes _input of the substep.
You usually do not need to access groups of sos_targets directly but knowing the existence of groups would help you understand how groups are passed from one step to another.
For example, in the following workflow, when step 10 obtains output_from step A, it obtains a step_output with 4 groups, which then becomes the _input of each substep of step 10.
sos_targets accepts the zap() function which zap all file targets in ths list. This technique is usually used to remove large intermediate files during the execution of the workflow. For example, if you have a workflow that downloads and processs large files, you can do something like
[download: provides='{file}.fastq']
download: expand=True
http://some_url/{file}.fastq
[default]
input: [f'{x}.fastq' for x in range(1000)], group_by=1
output: _input.with_suffix('.bam')
sh: expand=True
process _input to _output
_input.zap()
In this example, 1000 fastq files are downloaded and processed, but the input files are zapped after they are processed. Although the files have been removed, re-running the workflow will not download and process the files again because the downloaded files still considered to exist by SoS.