SoS targets

Difficulty level: easy
Time need to lean: 10 minutes or less
Key points:
- Targets are objects that can be created and detected. They are passed to and from SoS steps as variables _input, _output, step_input and step_ouptut.
- file_target represent disk files and are the most common targets used in SoS.

Targets are objects that a SoS step can input, output, or dependent on. They are usually files that are presented by filenames, but can also be other targets.

Target `file_target`

Targets of type file_target represents files on a file system. The type file_target are usually not used explicitly because SoS treats string type targets as file_target. file_target

is derived from Python pathlib.Path
automatically expands user from path starting with ~
allows you to extend path with a + operation
has a special zap operation to replace (large) files with their signatures
accepts a list of format options to easily format path in different formats

Attributes and functions

file_target is derived from pathlib.Path but has a few additional features. It first automatically expands ~ if it is the first character of the path.

'targets.ipynb'

file_target('/Users/bpeng1/sos/sos-docs/src/user_guide')

file_target('/Users/bpeng1/sos/sos-docs/src/user_guide/something.txt')

'.ipynb'

('/',
 'Users',
 'bpeng1',
 'sos',
 'sos-docs',
 'src',
 'user_guide',
 'targets.ipynb')

True

False

file_target('/Users/bpeng1/sos/sos-docs/src/user_guide/sos_datatypes.html')

file_target('/Users/bpeng1/sos/sos-docs/src/user_guide/targets.html')

and you can evaluate file_target in format strings as

'Hello My name is targets.ipynb'

Note that file_target offers a os.PathLike interface and can be used directly with os.path functions such as

123983

The following is an example of using with_suffix on _input, which is of type sos_targets but in this case acts just like a file_target

Generating test1.bai from test1.bam
Generating test2.bai from test2.bam

Operator `+` for `file_target`

file_target allows you to extend file_target with a str or path with a + operation. For example, with

p + '.tmp' returns a path with .tmp appended to the path

file_target('test.txt.tmp')

which is different from the / operator that join the operant as another piece of the path

file_target('test.txt/.tmp')

A note of caution, however, is that because file_target strips ending slash from an input path

file_target('/path/to')

The result of the following can be surprising

file_target('/path/toa.txt')

so the rule of thumb is that you should use / to extend path and + to extend name, as in

file_target('/path/to/dir/filename.ext')

Because sos_targets with a single file_target will act like a file_target, the following is an example of using + directly on _input to generate _output of a substep

Generating test1.bam.bai from test1.bam
Generating test2.bam.bai from test2.bam

Format specification

file_target accepts a list of format options to easily format path in different formats:

convertor	operation	effect	operant	output
`a`	absolute path	`abspath()`	`test.sos`	`/path/to/test.sos`
`b`	base filename	`basename())`	`{home}/SoS/test.sos`	`test.sos`
`e`	escape	`replace(' ', '\\ ')`	`file 1.txt`	`file\ 1.txt`
`d`	directory name	`dirname()` or `'.'`	`/path/to/test.sos`	`/path/to`
`l`	expand link	`realpath()`	`test.sos`	`/realpath/to/test.sos`
`n`	remove extension	`splitext()[0]`	`/path/to/test.sos`	`/path/to/test`
`p`	posix name	`replace('\\', '/')...`	`c:\\Users`	`/c/Users`
`q`	quote	`quoted()`	`file 1.txt`	`'file 1.txt'`
`r`	repr	`repr()`	`file.txt`	`'file.txt'`
`s`	str	`str()`	`file.txt`	`file.txt`
`U`	undo expanduser	`replace(expanduser('~'), '~')`	`/home/user/test.sos`	`~/test.sos`
`x`	file extension	`splitext()[1]`	`~/SoS/test.sos`	`.sos`
`,`	join with comma	`','.join()`	`['a.txt', 'b.txt']`	`a.txt,b.txt`

These format options allow you to pass filenames to scripts in different formats. For example, it would be perfectly OK to pass ~/a.txt to a shell script, but a u formatter should be added if you are passing the filename to a script that does not understand ~ in filenames. For example,

'/Users/bpeng1/sos/examples/update_toc'

'update_toc.sos'

'update_toc'

'/Users/bpeng1/sos/examples/update_toc.sos'

'sos'

'~/sos/examples/update_toc.sos'

The last example is pretty interesting because it applies three converters and gets the name of grand-parent directory using an equivalence of basename(dirname(dirname(file))).

Finally, path formats the object with these format operators and then the resulting string with additional formatters. For example, you can format the path object as a regular string

'         /Users/bpeng1/sos/examples/update_toc.sos'

or apply path formatter (bn for base name of filename) and then as a regular string.

'                                        update_toc'

The following example uses :n to get the basename of _input and add new suffix. It is in this case the same as _input.with_suffix('.bam.bai') so it is often a personal preference which style to use.

Generating test1.bam.bai from test1.bam
Generating test2.bam.bai from test2.bam

`zap()` function

Another addition of the path type is a zap() function that removes the file and creates a {filename}.zapped file with file signatures. This .zapped file is considered to be "existent" by the runtime signature system so that a workflow step will not be repeated of some of its input or output files are zapped, unless the actually files are needed. This function is usually used as sos_targets.zap() to zap all input or output files, and will be demonstrated in detail in SoS datatypes.

Target `executable`

executable targets are commands that should be accessible and executable by SoS. These targets are usually listed in the depends section of a SoS step. For example, SoS would stop if a command fastqc is not found.

ERROR: No rule to generate target 'executable("some_command")', needed by '10'.

RuntimeError: Workflow exited with code 1

executable target can also be output of a step but installing executables can be tricky because the commands should be installed to existing $PATH so that they can be immediately accessible by SoS. Because SoS automatically adds ~/.sos/bin to $PATH (option -b), an environment-neutral way for on-the-fly installation is to install commands to this directory. For example

You can also have finer control over which version of the command is eligible by checking the output of commands. The trick here is to provide a complete command and one or more version strings as the string that should appear in the output of the command.

For example, command python --version is executed in the following example to check if the output contains string 5.18. The step would only be executed if the right version exists.

[#] 1 step processed (1 job completed)

If no verion string is provided, SoS will only check the existence of the command and not actually execute the command.

Target `sos_variable`

sos_variable(name) targets represent SoS variables that are created by a SoS step and shared to other steps. These targets can be used to provide information to other steps. For example,

There are 100 objects

Step 100 needed some information extracted from output of another step (step 10). You can either parse the information in step 100 or use another step to provide the information. The latter is recommended because the information could be requested by multiple steps. Note that counts is an auxiliary step that provides sos_variable('counts') through its shared section option.

Target `env_variable`

SoS keeps tract of runtime environment and creates signatures of executed steps so that they do not have to be executed again. Some commands, especially shell scripts, could however behave differently with different environmental variables. To make sure a step would be re-executed with changing environments, you should list the variables that affects the output of these commands as dependencies of the step. For example

ERROR: No rule to generate target 'env_variable("DEBUG")', needed by '10'.

RuntimeError: Workflow exited with code 1

Target `sos_step`

The sos_step target represents, needless to say, a SoS step. This target provides a straightforward method to specify step dependencies. For example,

Initialize
I am 10

What is more interesting, however, is that sos_step('a') matches to steps such as a_1, a_2 so the step will depend on the execution of the entire workflow.

For example, in the following workflow, step default depends on sos_step('work'), which triggers a process-oriented workflow work with steps work_1 and work_2.

This example is similar to the following workflow that uses subworkflow (sos_run('work')) but as you can see from the generated DAG, the execution logics of the two are quite different. More specifically, the sos_step() target adds a subworkflow to the master DAG, while sos_run triggers a separate DAG.

Target `dynamic`

A dynamic target is a target that can only be determined when the step is actually executed.

For example,

Last output is

To address this problem, you should try to expand the output file after the completion of the step, using a dynamic target.

Last output is a.bat

Please refer to chapter SoS Step for details of such targets.

Target `remote`

A target that is marked as remote and would be instantiated only when it is executed by a task. Please check section Remote Execution for details.

Target `system_resource`

Target system_resource checks the available system resource and is available only if the system has enough memory and/or diskspace for the workflow step. For example, the following step would generate an error if the system does not have at least 16G of RAM and 1T of disk space on the volume of the current project directory.

ERROR: No rule to generate target 'system_resource(mem='16G',disk='1T')', needed by '10'.

RuntimeError: Workflow exited with code 1

Target `R_library`

The R_library target represents a R library. If the libraries are not available, it will try to install it from CRAN, bioconductor, or github. Github package name should be formatted as pkg@path. A typical usage of this target would be

null device 
          1

R_library can also be used to check for specific versions of packages. For example:

R_library('edgeR', '3.12.0')

will result in a warning if edgeR version is not 3.12.0. You can specify multiple versions

R_library('edgeR', ['3.12.0', '3.12.1'])

certain version or newer,

R_library('edgeR', '>=3.12.0')

certain version or older

check_R_library('ggplot2', '<1.0.0')

The default R library repo is http://cran.us.r-project.org. It is possible to customize the repo for which a R library would be installed, for example:

R_library('Rmosek', repos = "http://download.mosek.com/R/7")

To install from a github repository:

R_library('varbvs@pcarbo/varbvs/varbvs-R')

where varbvs is package name, pcarbo/varbvs/varbvs-R corresponds to sub-directory varbvs-R in repository https://github.com/pcarbo/varbvs.

Target `Py_Module`

This target is usually used in the depends statement of a SoS step to specify a required Python module. For example,

-----  ------  -------------
Sun    696000     1.9891e+09
Earth    6371  5973.6
Moon     1737    73.5
Mars     3390   641.85
-----  ------  -------------

If a module is not available, with autoinstall=True SoS will try to execute command pip install to install it, which might or might not succeed depending on your system configuration. For example,

Py_Module('numpy', autoinstall=True)

To specify version,

Py_Module('numpy', version=">=1.14.0")

Or a shorthand syntax,

Py_Module('numpy>1.14.0')

SoS targets

Target file_target

Attributes and functions

Operator + for file_target

Format specification

zap() function

Target executable

Target sos_variable

Target env_variable

Target sos_step

Target dynamic

Target remote

Target system_resource

Target R_library

Target Py_Module