- Difficulty level: easy
- Time need to lean: 10 minutes or less
- Key points:
- Targets are objects that can be created and detected. They are passed to and from SoS steps as variables
_input
,_output
,step_input
andstep_ouptut
. file_target
represent disk files and are the most common targets used in SoS.
- Targets are objects that can be created and detected. They are passed to and from SoS steps as variables
Targets are objects that a SoS step can input, output, or dependent on. They are usually files that are presented by filenames, but can also be other targets.
Targets of type file_target
represents files on a file system. The type file_target
are usually not used explicitly because SoS treats string type targets as file_target
. file_target
- is derived from Python pathlib.Path
- automatically expands user from path starting with
~
- allows you to extend
path
with a+
operation - has a special
zap
operation to replace (large) files with their signatures - accepts a list of format options to easily format path in different formats
file_target
is derived from pathlib.Path
but has a few additional features. It first automatically expands ~
if it is the first character of the path.
and you can evaluate file_target
in format strings as
Note that file_target
offers a os.PathLike
interface and can be used directly with os.path
functions such as
The following is an example of using with_suffix
on _input
, which is of type sos_targets
but in this case acts just like a file_target
p + '.tmp'
returns a path
with .tmp
appended to the path
which is different from the /
operator that join the operant as another piece of the path
A note of caution, however, is that because file_target
strips ending slash from an input path
The result of the following can be surprising
so the rule of thumb is that you should use /
to extend path and +
to extend name, as in
Because sos_targets
with a single file_target
will act like a file_target
, the following is an example of using +
directly on _input
to generate _output
of a substep
file_target
accepts a list of format options to easily format path in different formats:
convertor | operation | effect | operant | output |
---|---|---|---|---|
a |
absolute path | abspath() |
test.sos |
/path/to/test.sos |
b |
base filename | basename()) |
{home}/SoS/test.sos |
test.sos |
e |
escape | replace(' ', '\\ ') |
file 1.txt |
file\ 1.txt |
d |
directory name | dirname() or '.' |
/path/to/test.sos |
/path/to |
l |
expand link | realpath() |
test.sos |
/realpath/to/test.sos |
n |
remove extension | splitext()[0] |
/path/to/test.sos |
/path/to/test |
p |
posix name | replace('\\', '/')... |
c:\\Users |
/c/Users |
q |
quote | quoted() |
file 1.txt |
'file 1.txt' |
r |
repr | repr() |
file.txt |
'file.txt' |
s |
str | str() |
file.txt |
file.txt |
U |
undo expanduser | replace(expanduser('~'), '~') |
/home/user/test.sos |
~/test.sos |
x |
file extension | splitext()[1] |
~/SoS/test.sos |
.sos |
, |
join with comma | ','.join() |
['a.txt', 'b.txt'] |
a.txt,b.txt |
These format options allow you to pass filenames to scripts in different formats. For example, it would be perfectly OK to pass ~/a.txt
to a shell script, but a u
formatter should be added if you are passing the filename to a script that does not understand ~
in filenames. For example,
The last example is pretty interesting because it applies three converters and gets the name of grand-parent directory using an equivalence of basename(dirname(dirname(file)))
.
Finally, path
formats the object with these format operators and then the resulting string with additional formatters. For example, you can format the path
object as a regular string
or apply path
formatter (bn
for base name of filename) and then as a regular string.
The following example uses :n
to get the basename of _input
and add new suffix. It is in this case the same as _input.with_suffix('.bam.bai')
so it is often a personal preference which style to use.
Another addition of the path
type is a zap()
function that removes the file and creates a {filename}.zapped
file with file signatures. This .zapped
file is considered to be "existent" by the runtime signature system so that a workflow step will not be repeated of some of its input or output files are zapped, unless the actually files are needed. This function is usually used as sos_targets.zap()
to zap all input or output files, and will be demonstrated in detail in SoS datatypes.
executable
targets are commands that should be accessible and executable by SoS. These targets are usually listed in the depends
section of a SoS step. For example, SoS would stop if a command fastqc
is not found.
executable
target can also be output of a step but installing executables can be tricky because the commands should be installed to existing $PATH
so that they can be immediately accessible by SoS. Because SoS automatically adds ~/.sos/bin
to $PATH
(option -b
), an environment-neutral way for on-the-fly installation is to install commands to this directory. For example
You can also have finer control over which version of the command is eligible by checking the output of commands. The trick here is to provide a complete command and one or more version strings as the string that should appear in the output of the command.
For example, command python --version
is executed in the following example to check if the output contains string 5.18
. The step would only be executed if the right version exists.
If no verion string is provided, SoS will only check the existence of the command and not actually execute the command.
Step 100
needed some information extracted from output of another step (step 10
). You can either parse the information in step 100
or use another step to provide the information. The latter is recommended because the information could be requested by multiple steps. Note that counts
is an auxiliary step that provides sos_variable('counts')
through its shared
section option.
SoS keeps tract of runtime environment and creates signatures of executed steps so that they do not have to be executed again. Some commands, especially shell scripts, could however behave differently with different environmental variables. To make sure a step would be re-executed with changing environments, you should list the variables that affects the output of these commands as dependencies of the step. For example
The sos_step
target represents, needless to say, a SoS step. This target provides a straightforward method to specify step dependencies. For example,
What is more interesting, however, is that sos_step('a')
matches to steps such as a_1
, a_2
so the step will depend on the execution of the entire workflow.
For example, in the following workflow, step default
depends on sos_step('work')
, which triggers a process-oriented workflow work
with steps work_1
and work_2
.
This example is similar to the following workflow that uses subworkflow (sos_run('work')
) but as you can see from the generated DAG, the execution logics of the two are quite different. More specifically, the sos_step()
target adds a subworkflow to the master DAG, while sos_run
triggers a separate DAG.
To address this problem, you should try to expand the output file after the completion of the step, using a dynamic
target.
Please refer to chapter SoS Step for details of such targets.
A target that is marked as remote
and would be instantiated only when it is executed by a task. Please check section Remote Execution for details.
Target system_resource
checks the available system resource and is available only if the system has enough memory and/or diskspace for the workflow step. For example, the following step would generate an error if the system does not have at least 16G
of RAM and 1T
of disk space on the volume of the current project directory.
The R_library
target represents a R library. If the libraries are not available, it will try to install it from CRAN, bioconductor, or github. Github package name should be formatted as pkg@path
. A typical usage of this target would be
R_library
can also be used to check for specific versions of packages. For example:
R_library('edgeR', '3.12.0')
will result in a warning if edgeR version is not 3.12.0. You can specify multiple versions
R_library('edgeR', ['3.12.0', '3.12.1'])
certain version or newer,
R_library('edgeR', '>=3.12.0')
certain version or older
check_R_library('ggplot2', '<1.0.0')
The default R library repo is http://cran.us.r-project.org
. It is possible to customize the repo for which a R library would be installed, for example:
R_library('Rmosek', repos = "http://download.mosek.com/R/7")
To install from a github repository:
R_library('varbvs@pcarbo/varbvs/varbvs-R')
where varbvs
is package name, pcarbo/varbvs/varbvs-R
corresponds to sub-directory varbvs-R
in repository https://github.com/pcarbo/varbvs
.
This target is usually used in the depends
statement of a SoS step to specify a required Python module. For example,
If a module is not available, with autoinstall=True
SoS will try to execute command pip install
to install it, which might or might not succeed depending on your system configuration. For example,
Py_Module('numpy', autoinstall=True)
To specify version,
Py_Module('numpy', version=">=1.14.0")
Or a shorthand syntax,
Py_Module('numpy>1.14.0')