SoS Actions and common action options

Difficulty level: intermediate
Time need to lean: 10 minutes or less
Key points:
- SoS actions are Python functions that usually starts an interpreter to execute a script
- Parameters of actions allow you to execute actions with additional parameter, control input and output, and execute in containers

SoS Actions

Although arbitrary python functions can be used in SoS step process, SoS defines many special functions called actions that accepts some shared parameters, and can behave differently in different running modes of SoS.

For example, command sleep 5 would be executed in run mode,

[......32m#] 1 step processed (1 job completed)

However, if the action is executed in dryrun mode (option -n), it will just print the script it is intended to execute.

[#] 1 step processed (1 job completed)

Action options

Actions can have their own parameters but they all accept common set of options that define how they interact with SoS.

Option `active`

Action option active is used to activate or inactivate an action. It accept either a condition that returns a boolean variable (True or False), or one or more integers, or slices that corresponds to indexes of active substeps.

The first usage allows you to execute an action only if certain condition is met, so

if cond:
  action(script)

is equivalent to

action(script, active=cond)

or

action: active=cond
  script

in script format. For example, the following action will only be executed if a.txt exists

       1       1      10 a.txt

For the second usage, when a loop is defined by for_each or group_by options of input: statement, an action after input would be repeated for each substep. The active parameter accepts an integer, either a non-negative number, a negative number (counting backward), a sequence of indexes, or a slice object, for which the action would be active.

For example, for an input loop that loops through a sequence of numbers, the first action run is executed for all groups, the second action is executed for even number of groups, the last action is executed for the last step.

A at substep 0
B at substep 0
A at substep 1
A at substep 2
B at substep 2
A at substep 3
A at substep 4
B at substep 4
C at substep 4

Option `allow_error`

Option allow_error tells SoS that the action might fail but this should not stop the workflow from executing. This option essentially turns an error to a warning message and change the return value of action to None.

For example, in the following example, the wrong shell script would stop the execution of the step so the following action is not executed.

This is not shell
/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmps3bsqzt5.sh: line 1: This: command not found

ExecuteError: [0]: 
Failed to execute ``/bin/bash -ev .sos/scratch_0_0_adafd66b.sh``
exitcode=127, workdir=``/Users/bpeng1/sos/sos-docs/src/user_guide``
---------------------------------------------------------------------------

With option allow_error=True, the error from the sh action would turn to a warning and the rest of the step would continue to execute:

/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmp49o8mjw8.sh: line 1: The: command not found
Step after run

Option `args`

All script-executing actions accept an option args, which changes how the script is executed.

By default, such an action has an interpreter (e.g. bash), a default args='{filename:q}', and the script would be executed as interpreter args, which is

bash {filename:q}

where {filename:q} would be replaced by the script file created from the body of the action.

If you would like to change the command line with additional parameters, or different format of filename, you can specify an alternative args, with variables filename (filename of temporary script) and script (actual content of the script).

For example, you can pass command line options to a bash script using args as follows

ARG1 ARG2

and you can actually execute a command without filename, and instead executing the script directly from command line

10000 loops, best of 5: 22.5 usec per loop

Options `container` and `engine`

Parameter container and engine specify name or URL and execution engine of the container used to execute the action. Parameter engine is usually derived from container but can be specified explicitly as one of

engine='docker': Execute the script in specified container using docker
engine='singularity': Execute the script with singularity
engine='local': Execute the script locally, this is the default mode.

Parameters container and engine accept the following values:

`container`	`engine`	execute by	example	comment
`tag`		docker	`container='ubuntu'`	docker is the default container engine
`name`	`docker`	docker	`container='ubuntu', engine='docker'`	treat `name` as docker tag
`docker://tag`		docker	`container='docker://ubuntu'`
`filename.simg`		singularity	`container='ubuntu.simg'`
`shub://tag`		singularity	`container='shub://GodloveD/lolcow'`	Image will be pulled to a local image
`library://tag`		singularity	`container='library://GodloveD/lolcow'`	Image will be pulled to a local image
`name`	`singularity`	singularity	`container='a_dir', engine='singularity'`	treat `name` as singularity image file or directory
`docker://tag`	`singularity`	singularity	`container='docker://godlovdc/lolcow', engine='singularity'`
`file://filename`		singularity	`container='file://ubuntu.simg'`
`local://name`		local	`container='local:any_tag'`	`local://any_tag` is equivalent to `engine='local'`
`name`	`local`	local	`engine=engine` with `parameter: engine='docker'`	Usually used to override parameter `container`

Basically,

container='tag' pulls and uses docker image tag
container='filename.simg uses an existing singularity image
container='shub://tag' pulls and uses singularity image shub://tag, which will generate a local tag.simg file

If a docker image is specified, the action is assumed to be executed in the specified docker container. The image will be automatically downloaded (pulled) if it is not available locally.

For example, executing the following script

[10]
python3: container='python'
  set = {'a', 'b'}
  print(set)

under a docker terminal (that is connected to the docker daemon) will

Pull docker image python, which is the official docker image for Python 2 and 3.
Create a python script with the specified content
Run the docker container python and make the script available inside the container
Use the python3 command inside the container to execute the script.

Additional docker_run parameters can be passed to actions when the action is executed in a docker image. These options include

name: name of the container (option --name)
tty: if a tty is attached (default to True, option -t)
stdin_open: if stdin should be open (default to False, option -i)
user: username (default o root, option -u)
environment: Can be a string, a list of string or dictinary of environment variables for docker (option -e)
volumes: shared volumes as a string or list of strings, in the format of hostdir (for hostdir:hostdir) or hostdir:mnt_dir, in addition to current working directory which will always be shared.
volumes_from: container names or Ids to get volumes from
port: port opened (option -p)
extra_args: If there is any extra arguments you would like to pass to the docker run process (after you check the actual command of docker run of SoS

Because of the different configurations of docker images, use of docker in SoS can be complicated. Please refer to http://vatlab.github.io/doc/user_guide/docker.html for details.

Option `default_env`

Option default_env set environment variables if they do not exist in the system. The value of this option should be a dictionary with string keys and values.

For example, if we have a process that depends on an environmental variable DEBUG, you can set a default value for it

Working in DEBUG mode

If users actually set DEBUG to something else, the option will not be applied and shell script will be running in production mode.

Option `env`

Option env set environment variables that overrides system variables defined in os.environ. This option can be used to define PATH and other environmental variables for the action. Note that the effect of option is limited to this option.

Working in DEBUG mode

Option `input`

Although all actions accept parameter input, its usage vary among actions. Roughly speaking, script-executing actions such as run, bash and python prepend the content of all input files to the script; report-generation actions report, pandoc and RMarkdown append the content of input files after the specifie script, and other actions usually ignore this parameter.

For example, if you have defined a few utility functions that will be used by multiple scripts, you can define it in a separate file

and include it in python actions as follows:

Hello

Note that although SoS would check the existence of input files before executing the action, this option does not define any variable (such as _input) to be used in the script.

Option `output`

Similar to input, parameter output defines the output of an action, which can be a single name (or target) or a list of files or targets. SoS would check the existence of output target after the completion of the action. For example,

ERROR: [10]: [0]: 
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
script_5033863050000806077 in <module>
----> bash('\n', output='non_existing.txt')
      

RuntimeError: Output target non_existing.txt does not exist after completion of action bash

RuntimeError: Workflow exited with code 1

Option `stdout`

Option stdout is applicable to script-executing actions such as bash and R and redirect the standard out of the action to specified file. The value of the option should be a path-like object (str, path, etc), or False. The file will be opened in append mode so you will have to remove or truncate the file if the file already exists. If stdout=False, the output will be suppressed (redirect to /dev/null under linux).

For example,

auxiliary_steps.ipynb
cli.ipynb

Option `stderr`

Option stderr is similar to stdout but redirects the standard error output of actions. stderr=False also suppresses stderr.

Option `template` and `template_name`

Actions are by default executed directly with their interpreters, for example an R action will trigger a command Rscript script_name where script_name is a temporary file with the content of the script.

You could execute the command in a template that is specified either directly with option template, or a name with option template_name.

Expansion of template

When a template is specified directly, it should be a string with the following variables that will be expanded before execution:

variable	value
`cmd`	the command being executed (e.g. `Rscript script_name`)
`filename`	the script file (e.g. `script_name`) with type `sos_targets`
`script`	the script that is being executed
variable	any keyword argument

For example, with a template cat {filename}, the action prints the content of the script instead of executing it.

echo Hello

In another example, a template is used to calcuate the time used to execute the shell script.

It took 5 seconds

Pre-defined templates

If option template_name is specified, SoS will look into configuration files for a dictionary named action_templates for the template, and then for default templates provided by SoS.

For example, if we save templates show_script and time_me in a configuration file myconfig.yml

These templates can be used directly with option template_name:

It took 5 seconds

Built-in templates

Currently the following templates are provided

template_name	option	comment
`conda`	`env_name`	execute script in specified conda environment

To use built-in template conda, you will need to provide option env_name as a keyword argument as follows

RUNNING IN sos

Non-shell templates

Templates are by default shell scripts (and batch script under windows) and are executed as such. However, arbitrary interpreter could be specified with a shebang line in the template. For example, the following template wraps the python script directly to print execution time. Note that the braces that are not intepolated by SoS are doubled in the Python f-string.

It takes 2.0s to execute

Option `tracked`

If an action takes a long time to execute and the step it resides tend to be changed (for example, during the development of a workflow step), you might want to keep action-level signatures so that the action could be skipped if it has been executed before.

Action-level signature is controlled by parameter tracked, which can be None (no signature), True (record signature), False (do not record signature), a string (filename), or a list of filenames. When this parameter is True or one or more filenames, SoS will

if specified, collect targets specified by parameter input
if specified, colelct targets specified by parameter output
if one or more files are specified, collect targets from parameter tracked

These files, together with the content of the first parameter (usually a script), will be used to create a step signature and allow the actions with the same signature be skipped.

For example, suppose action sh is time-consuming that produces output test.txt

1577299883.543726

Because of the tracked=True parameter, a signature will be created with output and it will not be re-executed even when the step itself is changed (from sleep(2) to sleep(1)).

1577299883.543726

Note that the signature can only be saved and used with appropriate signature mode (force, default etc).

Option `workdir`

Option workdir changes the current working directory for the action, and change back once the action is executed. The directory will be created if it does not exist.

a.txt

SoS Actions and common action options

SoS Actions

Action options

Option active

Option allow_error

Option args

Options container and engine

Option default_env

Option env

Option input

Option output

Option stdout

Option stderr

Option template and template_name