Edit this page on our live server and create a PR by running command !create-pr in the console panel

The sos_targets data type

  • Difficulty level: intemediate
  • Time need to lean: 25 minutes or less
  • Key points:
    • All SoS step variables such as _input, _output, step_input and step_output are of type sos_targets
    • sos_targets consists of labeled SoS targets, has optional groups, special format specification, and some member functions.

SoS targets

A target is an object that can be created and detected. A SoS step can take a list of targets as input, check the existence of a list of dependent targets, and produce a list of targets as output. Input, output and dependent targets for steps and substeps are exposed to you as special variables _input, _output, _depends, step_input, and step_output that are all in type sos_targets.

sos_targets contains a list of targets (of type BaseTarget is your are curious), which can befile_target that represents a file on the file system, sos_variable that represents a defined variable, R_Library that represents a R library, or some other types. Please refer to SoS targets for details about SoS targets.

sos_targets data type

Construction of sos_targets

In SoS, the input statement mostly creates a step_input object with provided parameters. That is to say,

input: 'a.txt', 'b.txt', group_by=1

is almost equivalent to

step_input = sos_targets('a.txt', 'b.txt', group_by=1)

and we can use sos_targets objects directly in an input statement in more complicated cases.

Variable _input represents the input targets for each substep (groups of sos_targets as we will see later).

In the cases that a step contains only one substep, step_input is the same as _input. For example, variables step_input and _input of the following step are sos_targets objects with a single file_target object:

In [1]:
step_input='sos_datatypes.ipynb'
_input='sos_datatypes.ipynb'
    1348 sos_datatypes.ipynb

and if you have multiple input files, you can pass them altogether as a sos_targets with two file_target

In [2]:
step_input='sos_datatypes.ipynb' 'sos_magics.ipynb'
_input='sos_datatypes.ipynb' 'sos_magics.ipynb'
    1348 sos_datatypes.ipynb
    1285 sos_magics.ipynb

or separately as two groups of inputs:

In [3]:
step_input='sos_datatypes.ipynb' 'sos_magics.ipynb'
_input='sos_datatypes.ipynb'
    1348 sos_datatypes.ipynb

step_input='sos_datatypes.ipynb' 'sos_magics.ipynb'
_input='sos_magics.ipynb'
    1285 sos_magics.ipynb

In this case, the step input contains two file_target:

step_input = sos_targets('SoS_Syntax.ipynb', 'SoS_Magics.ipynb')`

but the step process is executed twice, with

_input = sos_targets('SoS_Syntax.ipynb')
_input = sos_targets('SoS_Magics.ipynb')

respectively. Because _input contains only one element, it is not necessary to use _input[0] in the script.

A list of targets

sos_targets type keeps a list of BaseTargets objects. It can be initialized from one or more str (for file_target), or other targets. Lists of targets or dictionary of targets (discussed later) will be flattened and concatenated so the end result will always be an one-dimensional list.

The variables appear to be a sequence that can be sliced and iterated. For example, the following statement creates a sos_targets object with three filenames from a single filename and a list of two filenames:

In [4]:
Out[4]:
[file_target('a.txt'), file_target('b.txt'), file_target('c.txt')]

You can access one or more elements of a sos_targets or iterate through it

In [5]:
Out[5]:
file_target('c.txt')
In [6]:
Out[6]:
[file_target('b.txt'), file_target('c.txt')]
In [7]:
a.txt
b.txt
c.txt

To convert a paths object to a regular list, you can use function list

In [8]:
Out[8]:
[file_target('a.txt'), file_target('b.txt'), file_target('c.txt')]

or slice part of the paths using slices

In [9]:
Out[9]:
list

Named paths

Under the hood paths are presented as type path (derived from pathlib.Path) and file targets are presented as type file_target that is derived from path. Paths that starts with ~ and # will be expanded automatically where

  1. Paths that starts with ~ will be expanded with os.path.expanduser.
  2. Paths that starts with #name will be expanded according to the hosts that the workflow is executed. The name should be defined in the host definition under the keys paths or shared.
In [1]:
path('~/a.txt')
/Users/bpeng/a.txt
In [2]:
path('#home/a.txt')
/Users/bpeng/a.txt

Now, if the same workflow is executed on docker, a remote host with different #home, the output is different.

In [4]:
path('#home/a.txt')
/root/a.txt

Format specification

sos_targets accepts a list of format options to easily format path in different formats. Here is a summary of format options with their effects:

convertor operation effect operant output
a absolute path abspath() test.sos /path/to/test.sos
b base filename basename()) {home}/SoS/test.sos test.sos
e escape replace(' ', '\\ ') file 1.txt file\ 1.txt
d directory name dirname() or '.' /path/to/test.sos /path/to
l expand link realpath() test.sos /realpath/to/test.sos
n remove extension splitext()[0] /path/to/test.sos /path/to/test
p posix name replace('\\', '/')... c:\\Users /c/Users
q quote quoted() file 1.txt 'file 1.txt'
r repr repr() file.txt 'file.txt'
s str str() file.txt file.txt
U undo expanduser replace(expanduser('~'), '~') /home/user/test.sos ~/test.sos
x file extension splitext()[1] ~/SoS/test.sos .sos
, join with comma ','.join() ['a.txt', 'b.txt'] a.txt,b.txt

These format options allow you to pass filenames to scripts in different formats. For example, it would be perfectly OK to pass ~/a.txt to a shell script, but a u formatter should be added if you are passing the filename to a script that does not understand ~ in filenames. For example,

In [10]:
> name: str of length 37
'/Users/bpeng1/sos/examples/update_toc'
> filename: str of length 14
'update_toc.sos'
> basefilename: str of length 10
'update_toc'
> expanded: str of length 41
'/Users/bpeng1/sos/examples/update_toc.sos'
> parparname: str of length 3
'sos'
> shortname: str of length 29
'~/sos/examples/update_toc.sos'

An important difference between the formatting of sos_targets and regular lists of BaseTarget is that formatting are applied to each item and joint by space or comma. For example, whereas a regular list is formatted as a list

In [11]:
Out[11]:
"['a.txt', 'b.txt', 'c.txt']"

A sos_targets is formatted as

In [12]:
Out[12]:
'a.txt b.txt c.txt'

or separated by , with format option ","

In [13]:
Out[13]:
'a.txt,b.txt,c.txt'

or after formatting each element with specified formatter

In [14]:
Out[14]:
"'a.txt','b.txt','c.txt'"

sos_targets with a single target

One particular consequence of this format rule is that a sos_targets with only one element will be formatted exactly like a single target so you can use _input (a sos_targets) in place of _input[0] (a file_target) if you know there is only one target inside _input:

In [15]:
Out[15]:
'sos_datatypes.ipynb is the same as sos_datatypes.ipynb'

As a matter of fact, if a sos_targets has only one element, it will pass unrecognized attributes and functions to this element, so that

In [16]:
Out[16]:
'.ipynb'
In [17]:
Out[17]:
file_target('/Users/bpeng1/sos/sos-docs/src/user_guide/sos_datatypes.ipynb')
In [18]:
Out[18]:
30597

Basically, you can use _input exactly as _input[0] if there is only one file_target in _input.

Attributes of target

Targets in sos_targets can be associated with arbitrary attributes. These attributes are usually assigned with option paired_with of an input statement.

Option paired_with accepts a dictionary and assigns attributes to each of the targets with specified values. For example,

In [19]:
A
B

Although targets and their attributes are usually set in an input statement, you can create targets and set attributes directly. For example

In [20]:
A
A

Here the target.set(name, value) function sets an attribute to the target, target.get(name, default=None) get the value of attribute name, and returns default if name is not a valid attribute. It is therefore a safer way to retrieve an attribute than target.name if you are uncertain if attribute name exists for target.

Labels of targets

Targets in a sos_targets has an attribute label, which correspond to the step that the target is specified (input) or generated (output). For example, the label of a sos_targets that is directly specified in a step is the name of step.

In [21]:
['step_10']

If you have multiple inputs, you can sparate them into different groups using keyword arguments

In [22]:
a.bam b.bam a.bai a.bai
['bam', 'bam', 'bai', 'bai']

If the input target is inherited from another step, the source will the name of that step.

In [23]:
[##] 2 steps processed (2 jobs completed)

In a more complex case when the source comes from multiple input steps and the present step, the labels attribute points out the source of each target:

In [24]:
[###] 3 steps processed (3 jobs completed)

Although the use of keyword argument will override the default source

In [25]:
c.txt a.txt b.txt
['step_30', 'prev', 'prev']

The source information can be used to select subsets of targets according to their labels. For example, _intput['prev'] would generate a sos_targets with all targets from source prev.

In [26]:
a.txt
['step_10']

groups of sos_targets

As we have seen, targets in a sos_targets can be grouped in many ways and _input contains subsets of the targets and is the input for each substep. For example, in the following example, the 4 input files are grouped into two groups of the same size. The step is executed twice, each time for a different group. step_input.groups contains a list of sos_targets that becomes _input of the substep.

In [27]:
Group 0
[[file_target('a.txt'), file_target('b.txt')], [file_target('c.txt'), file_target('d.txt')]]
a.txt b.txt

Group 1
[[file_target('a.txt'), file_target('b.txt')], [file_target('c.txt'), file_target('d.txt')]]
c.txt d.txt

You usually do not need to access groups of sos_targets directly but knowing the existence of groups would help you understand how groups are passed from one step to another.

For example, in the following workflow, when step 10 obtains output_from step A, it obtains a step_output with 4 groups, which then becomes the _input of each substep of step 10.

In [28]:
test_0.txt test_1.txt test_2.txt test_3.txt
test_0.txt
test_0.txt test_1.txt test_2.txt test_3.txt
test_1.txt
test_0.txt test_1.txt test_2.txt test_3.txt
test_2.txt
test_0.txt test_1.txt test_2.txt test_3.txt
test_3.txt

zap file targets

sos_targets accepts the zap() function which zap all file targets in ths list. This technique is usually used to remove large intermediate files during the execution of the workflow. For example, if you have a workflow that downloads and processs large files, you can do something like

[download: provides='{file}.fastq']
download: expand=True
    http://some_url/{file}.fastq

[default]
input: [f'{x}.fastq' for x in range(1000)], group_by=1
output: _input.with_suffix('.bam')
sh: expand=True
   process _input to _output
  
_input.zap()

In this example, 1000 fastq files are downloaded and processed, but the input files are zapped after they are processed. Although the files have been removed, re-running the workflow will not download and process the files again because the downloaded files still considered to exist by SoS.