- Difficulty level: difficult
- Time need to lean: 30 minutes or more
- Key points:
~/.sos/hosts.ymlis essential for the use of many features of SoStask_templateis used for the submission of tasks to remote hostsworkflow_templateis used for the execution of workflows on remote hostsprocess(default) andpbsengines are supported
SoS uses configuration files to specify hosts and their features. Please make sure you understand the basic syntax of SoS configuration files before you continue.
Instead of using a dedicated server side daemon process, SoS executes tasks or workflows on remote host through ssh. To use any remote host for the execution of SoS workflows or tasks
- Local hosts need to have
sshrelated tools such asssh,scp, andrsyncinstalled. - The remote hosts need to be accessible through public-key authentication, which can be set up manually or with command
sos remote setup. The key can be default (e.g.~/.ssh/id_rsa) or be specified as an identity file. - The remote hosts need to have
sosinstalled and have it in their$PATH. - Except for very simple cases, the remote hosts should be defined in SoS config files, preferrably in
site_config.ymlby your system admin, or~/.sos/hosts.ymlby you, so that SoS knows how to work with them.
SoS copies tasks or workflows to remote hosts and executes them under specified remote directories. That is to say, if local home directory is /home/bpeng1 and remote home directory is /Users/bpeng1, and a workflow is submitted from /home/bpeng1/Projects, it will be copied and executed under directory /Users/bpeng1/Projects (if so configured) on the remote host. This implies that
- A complete host definition consists of definitions of a localhost and a remote host so that SoS knows how to map directories.
- SoS will create directories and files under specified directories on remote host, and it is your responsibility to remove them.
SoS uses templates to execute tasks or workflows on remote hosts, which can be a batch system such as LSF, Slurm, and Torque. When dealing with a batch system, SoS generates a shell script from a host-specific template with specified parameters, send it to remote host and submit it. This is why host configuration is essential for the use of batch systems.
For your convenience, here is a summary of all possible properties of a host definition,
| property | usage |
|---|
| description | description of the host
| address | hostname or IP address, username@host is allowed. |
| port | ssh port, if different from 22. |
| hostname | Optional hostname used to identify localhost |
| pem_file | Identity file for one or more remote hosts |
| paths | Paths that could be mapped between local and remote hosts |
| shared | Shared file systems |
| queue_type | How tasks and workflows are managed, can be process (default), pbs, or rq (experimental) |
| max_running_jobs | max number of running jobs on the queue |
| workflow_template | template for the execution of workflows |
| task_template | template for the execution of tasks |
| sos | path to sos on remote host if not in $PATH |
| VARIABLE (any) | Variable used to interpolate other properties |
Properties for PBS queue_type
| property | usage |
|---|---|
submit_cmd |
(PBS only) command to submit job to PBS system |
submit_cmd_output |
(PBS only) output of submit_cmd, used to extract job_id |
status_cmd |
(PBS only) command to query job status |
kill_cmd |
(PBS only) command to kill PBS job |
Variables passed to workflow_template
| variable | usage |
|---|---|
filename |
filename of the script to be executed |
script |
content of the SoS workflow |
job_name |
an unique ID derived from content of workflow |
command |
command to be executed by SoS (sos run ...) |
VARIABLE (any) |
variables passed from host definition or command line |
Variable passed to task_template
| variable | usage |
|---|---|
task |
task ID |
nodes |
Number of nodes for a single task |
cores |
Number of cores for a single task |
mem |
Total RAM of a single task , passed to template in bytes |
walltime |
Total execution time of a task, passed to template in the format of HH:MM:SS |
workdir |
current project directory (mapped to remote host) |
command |
command to be executed by SoS (sos execute ...) |
VARIABLE (any) |
variables passed from host definition or command line |
Although not recommended, you can use the hostname or IP address of a remote host directly if you have set up public key access to a host so that you do not have to enter password to login.
For example, with option -r bcbm-bpeng.mdanderson.edu, the following example execute a shell script on remote host bcbm-bpeng.mdanderson.edu.
The workflow is executed under the "same" directory on bcbm-bpeng.mdanderson.edu. If you actually check the remote host, you will find a temporary .sos file under that directory. However, the example works because the remote host has directory /Users/bpeng1/sos/sos-docs/src/user_guide. The workflow would fail on, for example, a Linux system with only /home/bpeng1, and for which case host configurations are needed.
SoS hosts should be defined under key hosts of SoS configuration file, usually in ~/.sos/hosts.yml. A basic host definition specifies an alias, an address and hostname, and let SoS know the paths that could be matched with paths of another host.
A simple definition for bcbm-bpeng.mdanderson.org would be
With this definition, you can use bcbm as an alias to bcbm-bpeng.mdanderson.edu as follows
Again, the workflow works because the remote host has the same volumes as the local host. To properly set up a remote host, we need to define paths on local and remote hosts, so that we know how to map current working directory.
The following configuration file defines two hosts, one mac_pro and one bcbm, and defines localhost to be mac_pro. The localhost key tells SoS which host is the host that you are working on, it could be ignored if localhost could be identified by hostname or address of a host, but I have to specify it explicitly because mac_pro does not have a fixed hostname or IP address.
One of more paths could be defined under paths for each host, and SoS will try to map paths between local and remote hosts. For example, since we specify home of bcbm-bpeng to be /Users/bpeng1/scratch, a local directory /Users/bpeng1/sos/sos-docs/src/user_guide would be mapped to /Users/bpeng1/scratch/sos/sos-docs/src/user_guide on bcbm.
Now, when we execute the workflow on remote host, the workflow would be actually executed under /Users/bpeng1/scratch/sos/sos-docs/src/user_guide. It is a common technique to use a dedicated directory on remote host for SoS workflows to avoid overwriting useful files under /Users/bpeng1.
sos command on remote hostThe command line tool sos needs to be installed on the remote host to execute workflows or tasks there.
Since sos uses login shell to talk to remote hosts, it is generally good enough
if you have sos in $PATH. However, if you do not want to add
path to sos executable to you $PATH on the remote host, you
can define full path to sos in the host definition.
For example
hosts:
cluster:
address: my_cluster
sos: /share/app/python3.7/bin/sos
Option paths species common directories on different file systems. If local and remote hosts share certain file system, you can list them under shared so that SoS will not attempt to copy files. For example, if two disk volumes are mounted on both worker and server under different directories, you can list them as
hosts:
worker:
shared:
scratch: /mnt/scratch
data: /mnt/data
server:
shared:
scratch: /scratch/
data: /shared/data
In this way, if you are working under a directory /mnt/scratch/project, SoS knows that the workflow would be available under /scratch/project on server and execute it directory there without copying the workflow from worker to server.
A basic set up for public key authentication has a local private key (usually .ssh/id_rsa) and a public key that is listed in .ssh/authenticated_keys of the remote host. An identity file essentially allows you to use an alternative file to store the private key, or different private keys for different remote hosts. A typical case when a pem file is given is for connecting to AWS EC2 instances where pem files are generated by AWS for you to access the instances.
To allow the use of an identity file to connect to a remote host, you can define pem_file for the remote host. For example, you can have the following definition for an AWS EC2 instance
hosts:
aws:
address: ec2-user@xx.xx.xx.xx
pem_file: /path/to/my.pem
paths:
home: /home/ec2-user/
If you have multiple EC2 instances sharing the same identify file, you can define the pem_file in local host as
localhost: desktop
hosts:
desktop:
pem_file: /path/to/my.pem
aws1:
address: ec2-user@xx.xx.xx.xx
aws2:
address: ec2-user@xx.xx.xx.xx
If you have different identity files for different remote hosts, you can specify them as a dictionary:
localhost: desktop
hosts:
desktop:
pem_file:
aws1: /path/to/ec1.pem
aws2: /path/to/ec2.pem
aws1:
address: ec2-user@xx.xx.xx.xx
aws2:
address: ec2-user@xx.xx.xx.xx
non_aws:
address: another_host
Note that sos configurations can be split into multiple configuration files so you can defines hosts in site_config.yml or ~/.sos/hosts.yml, and location of identity files in a separate configuration file.
Each host has a queue that specifies how tasks and workflows are executed. The default queue_type is process, meaning that the task or workflow would be executed directly, subject to max_running_jobs, which is default to 10 for process queue.
For a PBS system such as LSF and Slurm, SoS creates a shell script from a workflow_template or task_template, submit it to the PBS system, and monitor its process.
The first property submit_cmd is the command that will be executed to submit the job. It accepts all variables for task_template, and an variable job_file that points to the location of the job file on the remote host. The submit_cmd is usually as simple as
qsub {job_file}
but could contain other variables such as walltime
msub -l {walltime} < {job_file}
After the task is submitted, SoS tries to capture a job_id from the output of the submit_cmd. The output differs from system to system so submit_cmd_output could be as simple as
submit_cmd_output='{job_id}'
or something like
submit_cmd_output='Job <{job_id}> is submitted to queue <{queue}>'
Although currently unused, status_cmd and kill_cmd should be commands to queue the status of or kill the PBS job with job_id. For example, for a basic torque system, these properties could be
status_cmd: qstat {job_id}
kill_cmd: qdel {job_id}
Variable passed to task_template
| variable | usage |
|---|---|
task |
task ID |
nodes |
Number of nodes for a single task |
cores |
Number of cores for a single task |
mem |
Total RAM of a single task , passed to template in bytes |
walltime |
Total execution time of a task, passed to template in the format of HH:MM:SS |
workdir |
current project directory (mapped to remote host) |
command |
command to be executed by SoS (sos execute ...) |
VARIABLE (any) |
variables passed from host definition or command line |
Variables mem, walltime etc are defined from task options
task: walltime='2h'
or from command line
%sos run -q cluster walltime=2h
to specify the resources needed for one task. The input values will be adjusted if multiple tasks are grouped together (with options trunk_size and trunk_workers). SoS recognizes the units of the input and convert it to stanard HH:MM:SS format before passing to the template.
SoS task executor treats ~/.sos/tasks/{task}.out and ~/.sos/tasks/{task}.err as the stdout and stderr of the PBS system and dependes on these files to report errors from PBS system. It is therefore required to specify these two files as the standard output and error output of the cluster job.
cluster:
address: host.url
description: cluster with PBS
paths:
home: /scratch/{user_name}
queue_type: pbs
status_check_interval: 30
wait_for_task: false
job_template: |
#!/bin/bash
#PBS -N {task}
#PBS -l nodes={nodes}:ppn={ppn}
#PBS -l walltime={walltime}
#PBS -l mem={mem//10**9}GB
#PBS -o /home/{user_name}/.sos/tasks/{task}.out
#PBS -e /home/{user_name}/.sos/tasks/{task}.err
#PBS -m ae
#PBS -M email@address
#PBS -v {workdir}
{command}
max_running_jobs: 100
submit_cmd: qsub {job_file}
status_cmd: qstat {job_id}
kill_cmd: qdel {job_id}
cluster:
address: host.url
description: cluster with MOAB
paths:
home: /scratch/{user_name}
queue_type: pbs
status_check_interval: 30
wait_for_task: false
job_template: |
#!/bin/bash
#PBS -N {task}
#PBS -l nodes={nodes}:ppn={ppn}
#PBS -l walltime={walltime}
#PBS -l mem={mem//10**9}GB
#PBS -o /home/{user_name}/.sos/tasks/{task}.out
#PBS -e /home/{user_name}/.sos/tasks/{task}.err
#PBS -m ae
#PBS -M email@address
#PBS -v {workdir}
{command}
max_running_jobs: 100
submit_cmd: msub {job_file}
status_cmd: qstat {job_id}
kill_cmd: qdel {job_id}
slurm:
description: cluster with SLURM
address: host.url
paths:
home: /home/{user_name}
queue_type: pbs
status_check_interval: 120
max_running_jobs: 15
max_cores: 28
max_walltime: "36:00:00"
max_mem: 256G
job_template: |
#!/bin/bash
#SBATCH --time={walltime}
#SBATCH --partition=mstephens
#SBATCH --account=pi-mstephens
#SBATCH --nodes=1
#SBATCH --ntasks-per-node={cores}
#SBATCH --mem-per-cpu={mem_per_cpu}
#SBATCH --job-name={task}
#SBATCH --output=/home/{user_name}/.sos/tasks/{task}.out
#SBATCH --error=/home/{user_name}/.sos/tasks/{task}.err
cd {workdir}
{command}
walltime: "06:00:00"
cores: 20
mem_per_cpu: 1000
submit_cmd: sbatch {job_file}
submit_cmd_output: "Submitted batch job {job_id}"
status_cmd: squeue --job {job_id}
kill_cmd: scancel {job_id}
lfs:
address: host.url
description: cluster with LSF
paths:
home: /rsrch2/bcb/{user_name}
queue_type: pbs
status_check_interval: 30
wait_for_task: false
job_template: |
#!/bin/bash
#BSUB -J {task}
#BSUB -q {'short' if int(walltime.split(':')[0]) < 24 else 'long'}
#BSUB -n {cores}
#BSUB -M {mem//10**9}G
#BSUB -W 1:0
#BSUB -o /home/{user_name}/.sos/tasks/{task}.out
#BSUB -e /home/{user_name}/.sos/tasks/{task}.err
#BSUB -N
#BSUB -u email@address
cd {workdir}
{command}
max_running_jobs: 100
submit_cmd: bsub < {job_file}
submit_cmd_output: 'Job <{job_id}> is submitted to queue <{queue}>'
status_cmd: bjobs {job_id}
kill_cmd: bkill {job_id}
Task Spooler is a light-weight task spooler for single machines.
taskspooler:
description: task spooler on a single machine
address: {user_name}@host.url
port: 32771
paths:
home: /home/{user_name}
queue_type: pbs
status_check_interval: 5
task_template: |
#!/bin/bash
cd {workdir}
{command}
max_running_jobs: 100
submit_cmd: tsp -L {task} sh {job_file}
status_cmd: tsp -s {job_id}
kill_cmd: tsp -r {job_id}
workflow_template defines how to execute a workflow on the host.
| variable | usage |
|---|---|
filename |
filename of the script to be executed |
script |
content of the SoS workflow |
job_name |
an unique ID derived from content of workflow |
command |
command to be executed by SoS (sos run ...) |
VARIABLE (any) |
variables passed from host definition or command line |
A workflow_template can be very similar or identical to task_template. However, in contrast to task_template where walltime, mem etc are converted and adjusted by SoS, these variables have to be fixed in the template or passed in string format to workflow_template, because variables for workflow_template can only be passed from command line such as
%run workflow -r host walltime=01:00:00
Template parameters can be used to increase the flexibility of templates. For example, you can specify the use of certain version of R for the execution of workflows using the following template
hpc_server:
address: ....
paths: ...
hosts:
hpc:
based_on: hpc_server
R_version: 3.3.1
workflow_template: |
module load R/{R_version}
{command}
and execute your workflow as follows:
sos run script -r hpc R_version=3.4.4
This method works but it requires you to specify the version of R each time, which can be hard to remember. You could make it easier by setting a default version as follows:
hosts:
hpc:
based_on: hpc_server
R_version: 3.4.4
workflow_template: |
module load R/{R_version}
{command}
In this way, variable R_version will be used for workflow_template, but will be overriden by R_version specified from command line.
If you would like to specify different versions in the template, you can define multiple hosts as follows:
hosts:
hpc_r3.4.4:
based_on: hpc_server
R_version: 3.4.4
workflow_template: |
module load R/{R_version}
{command}
hpc_r3.6.0:
based_on: hosts.hpc_r3.4.4:
R_version: 3.6.0
hpc_sklearn:
based_on: hosts.hpc_r3.6.0
workflow_template: |
module load R/{R_version}
module load sklearn
{command}
and use these environments with commands
sos run script -r hpc_r3.6.0
These templates make use of the facts that
based_oncopies specified entry- New definitions overrides contens from
based_onitems - Templates are expanded from variables defined in the same dictionary (e.g.
R_version)