- Difficulty level: intemediate
- Time need to lean: 25 minutes or less
- Key points:
- All SoS step variables such as
_input
,_output
,step_input
andstep_output
are of typesos_targets
sos_targets
consists of labeled SoS targets, has optional groups, special format specification, and some member functions.
- All SoS step variables such as
A target is an object that can be created and detected. A SoS step can take a list of targets as input, check the existence of a list of dependent targets, and produce a list of targets as output. Input, output and dependent targets for steps and substeps are exposed to you as special variables _input
, _output
, _depends
, step_input
, and step_output
that are all in type sos_targets
.
sos_targets
contains a list of targets (of type BaseTarget
is your are curious), which can befile_target
that represents a file on the file system, sos_variable
that represents a defined variable, R_Library
that represents a R library, or some other types. Please refer to SoS targets for details about SoS targets.
In SoS, the input
statement mostly creates a step_input
object with provided parameters. That is to say,
input: 'a.txt', 'b.txt', group_by=1
is almost equivalent to
step_input = sos_targets('a.txt', 'b.txt', group_by=1)
and we can use sos_targets
objects directly in an input
statement in more complicated cases.
Variable _input
represents the input targets for each substep (groups
of sos_targets
as we will see later).
In the cases that a step contains only one substep, step_input
is the same as _input
. For example, variables step_input
and _input
of the following step are sos_targets
objects with a single file_target
object:
and if you have multiple input files, you can pass them altogether as a sos_targets
with two file_target
or separately as two groups of inputs:
In this case, the step input contains two file_target
:
step_input = sos_targets('SoS_Syntax.ipynb', 'SoS_Magics.ipynb')`
but the step process is executed twice, with
_input = sos_targets('SoS_Syntax.ipynb')
_input = sos_targets('SoS_Magics.ipynb')
respectively. Because _input
contains only one element, it is not necessary to use _input[0]
in the script.
sos_targets
type keeps a list of BaseTargets
objects. It can be initialized from one or more str
(for file_target
), or other targets. Lists of targets or dictionary of targets (discussed later) will be flattened and concatenated so the end result will always be an one-dimensional list.
The variables appear to be a sequence that can be sliced and iterated. For example, the following statement creates a sos_targets
object with three filenames from a single filename and a list of two filenames:
You can access one or more elements of a sos_targets
or iterate through it
To convert a paths
object to a regular list, you can use function list
or slice part of the paths
using slices
Under the hood paths are presented as type path
(derived from pathlib.Path
) and file targets are presented as type file_target
that is derived from path
. Paths that starts with ~
and #
will be expanded automatically where
- Paths that starts with
~
will be expanded withos.path.expanduser
. - Paths that starts with
#name
will be expanded according to the hosts that the workflow is executed. Thename
should be defined in the host definition under the keyspaths
orshared
.
Now, if the same workflow is executed on docker
, a remote host with different #home
, the output is different.
sos_targets
accepts a list of format options to easily format path in different formats. Here is a summary of format options with their effects:
convertor | operation | effect | operant | output |
---|---|---|---|---|
a |
absolute path | abspath() |
test.sos |
/path/to/test.sos |
b |
base filename | basename()) |
{home}/SoS/test.sos |
test.sos |
e |
escape | replace(' ', '\\ ') |
file 1.txt |
file\ 1.txt |
d |
directory name | dirname() or '.' |
/path/to/test.sos |
/path/to |
l |
expand link | realpath() |
test.sos |
/realpath/to/test.sos |
n |
remove extension | splitext()[0] |
/path/to/test.sos |
/path/to/test |
p |
posix name | replace('\\', '/')... |
c:\\Users |
/c/Users |
q |
quote | quoted() |
file 1.txt |
'file 1.txt' |
r |
repr | repr() |
file.txt |
'file.txt' |
s |
str | str() |
file.txt |
file.txt |
U |
undo expanduser | replace(expanduser('~'), '~') |
/home/user/test.sos |
~/test.sos |
x |
file extension | splitext()[1] |
~/SoS/test.sos |
.sos |
, |
join with comma | ','.join() |
['a.txt', 'b.txt'] |
a.txt,b.txt |
These format options allow you to pass filenames to scripts in different formats. For example, it would be perfectly OK to pass ~/a.txt
to a shell script, but a u
formatter should be added if you are passing the filename to a script that does not understand ~
in filenames. For example,
An important difference between the formatting of sos_targets
and regular lists of BaseTarget
is that formatting are applied to each item and joint by space or comma. For example, whereas a regular list is formatted as a list
A sos_targets
is formatted as
or separated by ,
with format option ","
or after formatting each element with specified formatter
One particular consequence of this format rule is that a sos_targets
with only one element will be formatted exactly like a single target so you can use _input
(a sos_targets
) in place of _input[0]
(a file_target
) if you know there is only one target inside _input
:
As a matter of fact, if a sos_targets
has only one element, it will pass unrecognized attributes and functions to this element, so that
Basically, you can use _input
exactly as _input[0]
if there is only one file_target
in _input
.
Targets in sos_targets
can be associated with arbitrary attributes. These attributes are usually assigned with option paired_with
of an input
statement.
Option paired_with
accepts a dictionary and assigns attributes to each of the targets with specified values. For example,
Although targets and their attributes are usually set in an input
statement, you can create targets and set attributes directly. For example
Here the target.set(name, value)
function sets an attribute to the target
, target.get(name, default=None)
get the value of attribute name
, and returns default
if name
is not a valid attribute. It is therefore a safer way to retrieve an attribute than target.name
if you are uncertain if attribute name
exists for target
.
Targets in a sos_targets
has an attribute label
, which correspond to the step that the target is specified (input) or generated (output). For example, the label
of a sos_targets
that is directly specified in a step is the name of step.
If you have multiple inputs, you can sparate them into different groups using keyword arguments
If the input target is inherited from another step, the source will the name of that step.
In a more complex case when the source comes from multiple input steps and the present step, the labels
attribute points out the source of each target:
Although the use of keyword argument will override the default source
The source
information can be used to select subsets of targets according to their labels. For example, _intput['prev']
would generate a sos_targets
with all targets from source prev
.
As we have seen, targets in a sos_targets
can be grouped in many ways and _input
contains subsets of the targets and is the input for each substep. For example, in the following example, the 4 input files are grouped into two groups of the same size. The step is executed twice, each time for a different group. step_input.groups
contains a list of sos_targets
that becomes _input
of the substep.
You usually do not need to access groups
of sos_targets
directly but knowing the existence of groups
would help you understand how groups are passed from one step to another.
For example, in the following workflow, when step 10
obtains output_from
step A
, it obtains a step_output
with 4 groups, which then becomes the _input
of each substep of step 10
.
sos_targets
accepts the zap()
function which zap
all file targets in ths list. This technique is usually used to remove large intermediate files during the execution of the workflow. For example, if you have a workflow that downloads and processs large files, you can do something like
[download: provides='{file}.fastq']
download: expand=True
http://some_url/{file}.fastq
[default]
input: [f'{x}.fastq' for x in range(1000)], group_by=1
output: _input.with_suffix('.bam')
sh: expand=True
process _input to _output
_input.zap()
In this example, 1000 fastq
files are downloaded and processed, but the input files are zapped after they are processed. Although the files have been removed, re-running the workflow will not download and process the files again because the downloaded files still considered to exist by SoS.