- Difficulty level: difficult
- Time need to lean: 30 minutes or more
- Key points:
~/.sos/hosts.yml
is essential for the use of many features of SoStask_template
is used for the submission of tasks to remote hostsworkflow_template
is used for the execution of workflows on remote hostsprocess
(default) andpbs
engines are supported
SoS uses configuration files to specify hosts and their features. Please make sure you understand the basic syntax of SoS configuration files before you continue.
Instead of using a dedicated server side daemon process, SoS executes tasks or workflows on remote host through ssh
. To use any remote host for the execution of SoS workflows or tasks
- Local hosts need to have
ssh
related tools such asssh
,scp
, andrsync
installed. - The remote hosts need to be accessible through public-key authentication, which can be set up manually or with command
sos remote setup
. The key can be default (e.g.~/.ssh/id_rsa
) or be specified as an identity file. - The remote hosts need to have
sos
installed and have it in their$PATH
. - Except for very simple cases, the remote hosts should be defined in SoS config files, preferrably in
site_config.yml
by your system admin, or~/.sos/hosts.yml
by you, so that SoS knows how to work with them.
SoS copies tasks or workflows to remote hosts and executes them under specified remote directories. That is to say, if local home directory is /home/bpeng1
and remote home directory is /Users/bpeng1
, and a workflow is submitted from /home/bpeng1/Projects
, it will be copied and executed under directory /Users/bpeng1/Projects
(if so configured) on the remote host. This implies that
- A complete host definition consists of definitions of a localhost and a remote host so that SoS knows how to map directories.
- SoS will create directories and files under specified directories on remote host, and it is your responsibility to remove them.
SoS uses templates to execute tasks or workflows on remote hosts, which can be a batch system such as LSF, Slurm, and Torque. When dealing with a batch system, SoS generates a shell script from a host-specific template with specified parameters, send it to remote host and submit it. This is why host configuration is essential for the use of batch systems.
For your convenience, here is a summary of all possible properties of a host definition,
property | usage |
---|
| description
| description of the host
| address
| hostname or IP address, username@host
is allowed. |
| port
| ssh
port, if different from 22. |
| hostname
| Optional hostname used to identify localhost |
| pem_file
| Identity file for one or more remote hosts |
| paths
| Paths that could be mapped between local and remote hosts |
| shared
| Shared file systems |
| queue_type
| How tasks and workflows are managed, can be process
(default), pbs
, or rq
(experimental) |
| max_running_jobs
| max number of running jobs on the queue |
| workflow_template
| template for the execution of workflows |
| task_template
| template for the execution of tasks |
| sos
| path to sos on remote host if not in $PATH
|
| VARIABLE
(any) | Variable used to interpolate other properties |
Properties for PBS queue_type
property | usage |
---|---|
submit_cmd |
(PBS only) command to submit job to PBS system |
submit_cmd_output |
(PBS only) output of submit_cmd , used to extract job_id |
status_cmd |
(PBS only) command to query job status |
kill_cmd |
(PBS only) command to kill PBS job |
Variables passed to workflow_template
variable | usage |
---|---|
filename |
filename of the script to be executed |
script |
content of the SoS workflow |
job_name |
an unique ID derived from content of workflow |
command |
command to be executed by SoS (sos run ... ) |
VARIABLE (any) |
variables passed from host definition or command line |
Variable passed to task_template
variable | usage |
---|---|
task |
task ID |
nodes |
Number of nodes for a single task |
cores |
Number of cores for a single task |
mem |
Total RAM of a single task , passed to template in bytes |
walltime |
Total execution time of a task, passed to template in the format of HH:MM:SS |
workdir |
current project directory (mapped to remote host) |
command |
command to be executed by SoS (sos execute ... ) |
VARIABLE (any) |
variables passed from host definition or command line |
Although not recommended, you can use the hostname or IP address of a remote host directly if you have set up public key access to a host so that you do not have to enter password to login.
For example, with option -r bcbm-bpeng.mdanderson.edu
, the following example execute a shell script on remote host bcbm-bpeng.mdanderson.edu
.
The workflow is executed under the "same" directory on bcbm-bpeng.mdanderson.edu
. If you actually check the remote host, you will find a temporary .sos
file under that directory. However, the example works because the remote host has directory /Users/bpeng1/sos/sos-docs/src/user_guide
. The workflow would fail on, for example, a Linux system with only /home/bpeng1
, and for which case host configurations are needed.
SoS hosts should be defined under key hosts
of SoS configuration file, usually in ~/.sos/hosts.yml
. A basic host definition specifies an alias
, an address
and hostname
, and let SoS know the paths
that could be matched with paths
of another host.
A simple definition for bcbm-bpeng.mdanderson.org
would be
With this definition, you can use bcbm
as an alias to bcbm-bpeng.mdanderson.edu
as follows
Again, the workflow works because the remote host has the same volumes as the local host. To properly set up a remote host, we need to define paths
on local and remote hosts, so that we know how to map current working directory.
The following configuration file defines two hosts, one mac_pro
and one bcbm
, and defines localhost
to be mac_pro
. The localhost
key tells SoS which host is the host that you are working on, it could be ignored if localhost could be identified by hostname
or address
of a host, but I have to specify it explicitly because mac_pro
does not have a fixed hostname or IP address.
One of more paths could be defined under paths
for each host, and SoS will try to map paths between local and remote hosts. For example, since we specify home
of bcbm-bpeng
to be /Users/bpeng1/scratch
, a local directory /Users/bpeng1/sos/sos-docs/src/user_guide
would be mapped to /Users/bpeng1/scratch/sos/sos-docs/src/user_guide
on bcbm
.
Now, when we execute the workflow on remote host, the workflow would be actually executed under /Users/bpeng1/scratch/sos/sos-docs/src/user_guide
. It is a common technique to use a dedicated directory on remote host for SoS workflows to avoid overwriting useful files under /Users/bpeng1
.
sos
command on remote hostThe command line tool sos
needs to be installed on the remote host to execute workflows or tasks there.
Since sos
uses login shell to talk to remote hosts, it is generally good enough
if you have sos
in $PATH
. However, if you do not want to add
path to sos
executable to you $PATH
on the remote host, you
can define full path to sos
in the host definition.
For example
hosts: cluster: address: my_cluster sos: /share/app/python3.7/bin/sos
Option paths
species common directories on different file systems. If local and remote hosts share certain file system, you can list them under shared
so that SoS will not attempt to copy files. For example, if two disk volumes are mounted on both worker
and server
under different directories, you can list them as
hosts:
worker:
shared:
scratch: /mnt/scratch
data: /mnt/data
server:
shared:
scratch: /scratch/
data: /shared/data
In this way, if you are working under a directory /mnt/scratch/project
, SoS knows that the workflow would be available under /scratch/project
on server
and execute it directory there without copying the workflow from worker
to server
.
A basic set up for public key authentication has a local private key (usually .ssh/id_rsa
) and a public key that is listed in .ssh/authenticated_keys
of the remote host. An identity file essentially allows you to use an alternative file to store the private key, or different private keys for different remote hosts. A typical case when a pem
file is given is for connecting to AWS EC2 instances where pem
files are generated by AWS for you to access the instances.
To allow the use of an identity file to connect to a remote host, you can define pem_file
for the remote host. For example, you can have the following definition for an AWS EC2 instance
hosts:
aws:
address: ec2-user@xx.xx.xx.xx
pem_file: /path/to/my.pem
paths:
home: /home/ec2-user/
If you have multiple EC2 instances sharing the same identify file, you can define the pem_file
in local host as
localhost: desktop
hosts:
desktop:
pem_file: /path/to/my.pem
aws1:
address: ec2-user@xx.xx.xx.xx
aws2:
address: ec2-user@xx.xx.xx.xx
If you have different identity files for different remote hosts, you can specify them as a dictionary:
localhost: desktop
hosts:
desktop:
pem_file:
aws1: /path/to/ec1.pem
aws2: /path/to/ec2.pem
aws1:
address: ec2-user@xx.xx.xx.xx
aws2:
address: ec2-user@xx.xx.xx.xx
non_aws:
address: another_host
Note that sos configurations can be split into multiple configuration files so you can defines hosts in site_config.yml
or ~/.sos/hosts.yml
, and location of identity files in a separate configuration file.
Each host has a queue that specifies how tasks and workflows are executed. The default queue_type
is process
, meaning that the task or workflow would be executed directly, subject to max_running_jobs
, which is default to 10
for process
queue.
For a PBS system such as LSF and Slurm, SoS creates a shell script from a workflow_template
or task_template
, submit it to the PBS system, and monitor its process.
The first property submit_cmd
is the command that will be executed to submit the job. It accepts all variables for task_template
, and an variable job_file
that points to the location of the job file on the remote host. The submit_cmd
is usually as simple as
qsub {job_file}
but could contain other variables such as walltime
msub -l {walltime} < {job_file}
After the task is submitted, SoS tries to capture a job_id
from the output of the submit_cmd
. The output differs from system to system so submit_cmd_output
could be as simple as
submit_cmd_output='{job_id}'
or something like
submit_cmd_output='Job <{job_id}> is submitted to queue <{queue}>'
Although currently unused, status_cmd
and kill_cmd
should be commands to queue the status of or kill the PBS job with job_id
. For example, for a basic torque system, these properties could be
status_cmd: qstat {job_id}
kill_cmd: qdel {job_id}
Variable passed to task_template
variable | usage |
---|---|
task |
task ID |
nodes |
Number of nodes for a single task |
cores |
Number of cores for a single task |
mem |
Total RAM of a single task , passed to template in bytes |
walltime |
Total execution time of a task, passed to template in the format of HH:MM:SS |
workdir |
current project directory (mapped to remote host) |
command |
command to be executed by SoS (sos execute ... ) |
VARIABLE (any) |
variables passed from host definition or command line |
Variables mem
, walltime
etc are defined from task options
task: walltime='2h'
or from command line
%sos run -q cluster walltime=2h
to specify the resources needed for one task. The input values will be adjusted if multiple tasks are grouped together (with options trunk_size
and trunk_workers
). SoS recognizes the units of the input and convert it to stanard HH:MM:SS
format before passing to the template.
SoS task executor treats ~/.sos/tasks/{task}.out
and ~/.sos/tasks/{task}.err
as the stdout
and stderr
of the PBS system and dependes on these files to report errors from PBS system. It is therefore required to specify these two files as the standard output and error output of the cluster job.
cluster:
address: host.url
description: cluster with PBS
paths:
home: /scratch/{user_name}
queue_type: pbs
status_check_interval: 30
wait_for_task: false
job_template: |
#!/bin/bash
#PBS -N {task}
#PBS -l nodes={nodes}:ppn={ppn}
#PBS -l walltime={walltime}
#PBS -l mem={mem//10**9}GB
#PBS -o /home/{user_name}/.sos/tasks/{task}.out
#PBS -e /home/{user_name}/.sos/tasks/{task}.err
#PBS -m ae
#PBS -M email@address
#PBS -v {workdir}
{command}
max_running_jobs: 100
submit_cmd: qsub {job_file}
status_cmd: qstat {job_id}
kill_cmd: qdel {job_id}
cluster:
address: host.url
description: cluster with MOAB
paths:
home: /scratch/{user_name}
queue_type: pbs
status_check_interval: 30
wait_for_task: false
job_template: |
#!/bin/bash
#PBS -N {task}
#PBS -l nodes={nodes}:ppn={ppn}
#PBS -l walltime={walltime}
#PBS -l mem={mem//10**9}GB
#PBS -o /home/{user_name}/.sos/tasks/{task}.out
#PBS -e /home/{user_name}/.sos/tasks/{task}.err
#PBS -m ae
#PBS -M email@address
#PBS -v {workdir}
{command}
max_running_jobs: 100
submit_cmd: msub {job_file}
status_cmd: qstat {job_id}
kill_cmd: qdel {job_id}
slurm:
description: cluster with SLURM
address: host.url
paths:
home: /home/{user_name}
queue_type: pbs
status_check_interval: 120
max_running_jobs: 15
max_cores: 28
max_walltime: "36:00:00"
max_mem: 256G
job_template: |
#!/bin/bash
#SBATCH --time={walltime}
#SBATCH --partition=mstephens
#SBATCH --account=pi-mstephens
#SBATCH --nodes=1
#SBATCH --ntasks-per-node={cores}
#SBATCH --mem-per-cpu={mem_per_cpu}
#SBATCH --job-name={task}
#SBATCH --output=/home/{user_name}/.sos/tasks/{task}.out
#SBATCH --error=/home/{user_name}/.sos/tasks/{task}.err
cd {workdir}
{command}
walltime: "06:00:00"
cores: 20
mem_per_cpu: 1000
submit_cmd: sbatch {job_file}
submit_cmd_output: "Submitted batch job {job_id}"
status_cmd: squeue --job {job_id}
kill_cmd: scancel {job_id}
lfs:
address: host.url
description: cluster with LSF
paths:
home: /rsrch2/bcb/{user_name}
queue_type: pbs
status_check_interval: 30
wait_for_task: false
job_template: |
#!/bin/bash
#BSUB -J {task}
#BSUB -q {'short' if int(walltime.split(':')[0]) < 24 else 'long'}
#BSUB -n {cores}
#BSUB -M {mem//10**9}G
#BSUB -W 1:0
#BSUB -o /home/{user_name}/.sos/tasks/{task}.out
#BSUB -e /home/{user_name}/.sos/tasks/{task}.err
#BSUB -N
#BSUB -u email@address
cd {workdir}
{command}
max_running_jobs: 100
submit_cmd: bsub < {job_file}
submit_cmd_output: 'Job <{job_id}> is submitted to queue <{queue}>'
status_cmd: bjobs {job_id}
kill_cmd: bkill {job_id}
Task Spooler is a light-weight task spooler for single machines.
taskspooler:
description: task spooler on a single machine
address: {user_name}@host.url
port: 32771
paths:
home: /home/{user_name}
queue_type: pbs
status_check_interval: 5
task_template: |
#!/bin/bash
cd {workdir}
{command}
max_running_jobs: 100
submit_cmd: tsp -L {task} sh {job_file}
status_cmd: tsp -s {job_id}
kill_cmd: tsp -r {job_id}
workflow_template
defines how to execute a workflow on the host.
variable | usage |
---|---|
filename |
filename of the script to be executed |
script |
content of the SoS workflow |
job_name |
an unique ID derived from content of workflow |
command |
command to be executed by SoS (sos run ... ) |
VARIABLE (any) |
variables passed from host definition or command line |
A workflow_template
can be very similar or identical to task_template
. However, in contrast to task_template
where walltime
, mem
etc are converted and adjusted by SoS, these variables have to be fixed in the template or passed in string format to workflow_template
, because variables for workflow_template
can only be passed from command line such as
%run workflow -r host walltime=01:00:00
Template parameters can be used to increase the flexibility of templates. For example, you can specify the use of certain version of R for the execution of workflows using the following template
hpc_server:
address: ....
paths: ...
hosts:
hpc:
based_on: hpc_server
R_version: 3.3.1
workflow_template: |
module load R/{R_version}
{command}
and execute your workflow as follows:
sos run script -r hpc R_version=3.4.4
This method works but it requires you to specify the version of R each time, which can be hard to remember. You could make it easier by setting a default version as follows:
hosts:
hpc:
based_on: hpc_server
R_version: 3.4.4
workflow_template: |
module load R/{R_version}
{command}
In this way, variable R_version
will be used for workflow_template
, but will be overriden by R_version
specified from command line.
If you would like to specify different versions in the template, you can define multiple hosts as follows:
hosts:
hpc_r3.4.4:
based_on: hpc_server
R_version: 3.4.4
workflow_template: |
module load R/{R_version}
{command}
hpc_r3.6.0:
based_on: hosts.hpc_r3.4.4:
R_version: 3.6.0
hpc_sklearn:
based_on: hosts.hpc_r3.6.0
workflow_template: |
module load R/{R_version}
module load sklearn
{command}
and use these environments with commands
sos run script -r hpc_r3.6.0
These templates make use of the facts that
based_on
copies specified entry- New definitions overrides contens from
based_on
items - Templates are expanded from variables defined in the same dictionary (e.g.
R_version
)