Host configuration

Difficulty level: difficult
Time need to lean: 30 minutes or more
Key points:
- ~/.sos/hosts.yml is essential for the use of many features of SoS
- task_template is used for the submission of tasks to remote hosts
- workflow_template is used for the execution of workflows on remote hosts
- process (default) and pbs engines are supported

SoS uses configuration files to specify hosts and their features. Please make sure you understand the basic syntax of SoS configuration files before you continue.

Basic Host configuration

Overall concepts

Instead of using a dedicated server side daemon process, SoS executes tasks or workflows on remote host through ssh. To use any remote host for the execution of SoS workflows or tasks

Local hosts need to have ssh related tools such as ssh, scp, and rsync installed.
The remote hosts need to be accessible through public-key authentication, which can be set up manually or with command sos remote setup. The key can be default (e.g. ~/.ssh/id_rsa) or be specified as an identity file.
The remote hosts need to have sos installed and have it in their $PATH.
Except for very simple cases, the remote hosts should be defined in SoS config files, preferrably in site_config.yml by your system admin, or ~/.sos/hosts.yml by you, so that SoS knows how to work with them.

SoS copies tasks or workflows to remote hosts and executes them under specified remote directories. That is to say, if local home directory is /home/bpeng1 and remote home directory is /Users/bpeng1, and a workflow is submitted from /home/bpeng1/Projects, it will be copied and executed under directory /Users/bpeng1/Projects (if so configured) on the remote host. This implies that

A complete host definition consists of definitions of a localhost and a remote host so that SoS knows how to map directories.
SoS will create directories and files under specified directories on remote host, and it is your responsibility to remove them.

SoS uses templates to execute tasks or workflows on remote hosts, which can be a batch system such as LSF, Slurm, and Torque. When dealing with a batch system, SoS generates a shell script from a host-specific template with specified parameters, send it to remote host and submit it. This is why host configuration is essential for the use of batch systems.

For your convenience, here is a summary of all possible properties of a host definition,

property	usage

Properties for PBS queue_type

property	usage
`submit_cmd`	(PBS only) command to submit job to PBS system
`submit_cmd_output`	(PBS only) output of `submit_cmd`, used to extract `job_id`
`status_cmd`	(PBS only) command to query job status
`kill_cmd`	(PBS only) command to kill PBS job

Variables passed to workflow_template

variable	usage
`filename`	filename of the script to be executed
`script`	content of the SoS workflow
`job_name`	an unique ID derived from content of workflow
`command`	command to be executed by SoS (`sos run ...`)
`VARIABLE` (any)	variables passed from host definition or command line

Variable passed to task_template

variable	usage
`task`	task ID
`nodes`	Number of nodes for a single task
`cores`	Number of cores for a single task
`mem`	Total RAM of a single task , passed to template in bytes
`walltime`	Total execution time of a task, passed to template in the format of `HH:MM:SS`
`workdir`	current project directory (mapped to remote host)
`command`	command to be executed by SoS (`sos execute ...`)
`VARIABLE` (any)	variables passed from host definition or command line

Use hostname or IP address directly

Although not recommended, you can use the hostname or IP address of a remote host directly if you have set up public key access to a host so that you do not have to enter password to login.

For example, with option -r bcbm-bpeng.mdanderson.edu, the following example execute a shell script on remote host bcbm-bpeng.mdanderson.edu.

INFO: No matching tasks are identified. Use option -a to check all tasks.
INFO: Running default: 
Working on /Users/bpeng1/sos/sos-docs/src/user_guide of bcbm-bpeng.mdanderson.edu
INFO: default is completed.
INFO: Workflow default (ID=0c287556fb76fee1) is executed successfully with 1 completed step.

The workflow is executed under the "same" directory on bcbm-bpeng.mdanderson.edu. If you actually check the remote host, you will find a temporary .sos file under that directory. However, the example works because the remote host has directory /Users/bpeng1/sos/sos-docs/src/user_guide. The workflow would fail on, for example, a Linux system with only /home/bpeng1, and for which case host configurations are needed.

Basic host configuration

SoS hosts should be defined under key hosts of SoS configuration file, usually in ~/.sos/hosts.yml. A basic host definition specifies an alias, an address and hostname, and let SoS know the paths that could be matched with paths of another host.

A simple definition for bcbm-bpeng.mdanderson.org would be

With this definition, you can use bcbm as an alias to bcbm-bpeng.mdanderson.edu as follows

INFO: No matching tasks are identified. Use option -a to check all tasks.
INFO: Running default: 
Working on /Users/bpeng1/sos/sos-docs/src/user_guide of bcbm-bpeng.mdanderson.edu
INFO: default is completed.
INFO: Workflow default (ID=0c287556fb76fee1) is executed successfully with 1 completed step.

Again, the workflow works because the remote host has the same volumes as the local host. To properly set up a remote host, we need to define paths on local and remote hosts, so that we know how to map current working directory.

The following configuration file defines two hosts, one mac_pro and one bcbm, and defines localhost to be mac_pro. The localhost key tells SoS which host is the host that you are working on, it could be ignored if localhost could be identified by hostname or address of a host, but I have to specify it explicitly because mac_pro does not have a fixed hostname or IP address.

One of more paths could be defined under paths for each host, and SoS will try to map paths between local and remote hosts. For example, since we specify home of bcbm-bpeng to be /Users/bpeng1/scratch, a local directory /Users/bpeng1/sos/sos-docs/src/user_guide would be mapped to /Users/bpeng1/scratch/sos/sos-docs/src/user_guide on bcbm.

Now, when we execute the workflow on remote host, the workflow would be actually executed under /Users/bpeng1/scratch/sos/sos-docs/src/user_guide. It is a common technique to use a dedicated directory on remote host for SoS workflows to avoid overwriting useful files under /Users/bpeng1.

sos command on remote host

The command line tool sos needs to be installed on the remote host to execute workflows or tasks there. Since sos uses login shell to talk to remote hosts, it is generally good enough if you have sos in $PATH. However, if you do not want to add path to sos executable to you $PATH on the remote host, you can define full path to sos in the host definition.

For example

hosts:
  cluster:
    address: my_cluster   
    sos: /share/app/python3.7/bin/sos

ERROR: Failed to connect to bcbm: ssh: Could not resolve hostname bcbm: nodename nor servname provided, or not known

Workflow exited with code 1

Option paths species common directories on different file systems. If local and remote hosts share certain file system, you can list them under shared so that SoS will not attempt to copy files. For example, if two disk volumes are mounted on both worker and server under different directories, you can list them as

hosts:
    worker:
        shared:
            scratch: /mnt/scratch
            data: /mnt/data
    server:
        shared:
            scratch: /scratch/
            data: /shared/data

In this way, if you are working under a directory /mnt/scratch/project, SoS knows that the workflow would be available under /scratch/project on server and execute it directory there without copying the workflow from worker to server.

Use of identity files

A basic set up for public key authentication has a local private key (usually .ssh/id_rsa) and a public key that is listed in .ssh/authenticated_keys of the remote host. An identity file essentially allows you to use an alternative file to store the private key, or different private keys for different remote hosts. A typical case when a pem file is given is for connecting to AWS EC2 instances where pem files are generated by AWS for you to access the instances.

To allow the use of an identity file to connect to a remote host, you can define pem_file for the remote host. For example, you can have the following definition for an AWS EC2 instance

hosts:
    aws:
        address: ec2-user@xx.xx.xx.xx
        pem_file: /path/to/my.pem
        paths:
            home:  /home/ec2-user/

If you have multiple EC2 instances sharing the same identify file, you can define the pem_file in local host as

localhost: desktop
hosts:
    desktop:
        pem_file: /path/to/my.pem
    aws1:
        address: ec2-user@xx.xx.xx.xx        
    aws2:
        address: ec2-user@xx.xx.xx.xx

If you have different identity files for different remote hosts, you can specify them as a dictionary:

localhost: desktop
hosts:
    desktop:
        pem_file:
            aws1: /path/to/ec1.pem            
            aws2: /path/to/ec2.pem
    aws1:
        address: ec2-user@xx.xx.xx.xx        
    aws2:
        address: ec2-user@xx.xx.xx.xx
    non_aws:
        address: another_host

Note that sos configurations can be split into multiple configuration files so you can defines hosts in site_config.yml or ~/.sos/hosts.yml, and location of identity files in a separate configuration file.

Process queue

Each host has a queue that specifies how tasks and workflows are executed. The default queue_type is process, meaning that the task or workflow would be executed directly, subject to max_running_jobs, which is default to 10 for process queue.

PBS queue

For a PBS system such as LSF and Slurm, SoS creates a shell script from a workflow_template or task_template, submit it to the PBS system, and monitor its process.

The first property submit_cmd is the command that will be executed to submit the job. It accepts all variables for task_template, and an variable job_file that points to the location of the job file on the remote host. The submit_cmd is usually as simple as

qsub {job_file}

but could contain other variables such as walltime

msub -l {walltime} < {job_file}

After the task is submitted, SoS tries to capture a job_id from the output of the submit_cmd. The output differs from system to system so submit_cmd_output could be as simple as

submit_cmd_output='{job_id}'

or something like

submit_cmd_output='Job <{job_id}> is submitted to queue <{queue}>'

Although currently unused, status_cmd and kill_cmd should be commands to queue the status of or kill the PBS job with job_id. For example, for a basic torque system, these properties could be

status_cmd: qstat {job_id}
kill_cmd: qdel {job_id}

`task_template`

Variable passed to task_template

variable	usage
`task`	task ID
`nodes`	Number of nodes for a single task
`cores`	Number of cores for a single task
`mem`	Total RAM of a single task , passed to template in bytes
`walltime`	Total execution time of a task, passed to template in the format of `HH:MM:SS`
`workdir`	current project directory (mapped to remote host)
`command`	command to be executed by SoS (`sos execute ...`)
`VARIABLE` (any)	variables passed from host definition or command line

Variables mem, walltime etc are defined from task options

task: walltime='2h'

or from command line

%sos run -q cluster walltime=2h

to specify the resources needed for one task. The input values will be adjusted if multiple tasks are grouped together (with options trunk_size and trunk_workers). SoS recognizes the units of the input and convert it to stanard HH:MM:SS format before passing to the template.

SoS task executor treats ~/.sos/tasks/{task}.out and ~/.sos/tasks/{task}.err as the stdout and stderr of the PBS system and dependes on these files to report errors from PBS system. It is therefore required to specify these two files as the standard output and error output of the cluster job.

Sample `task_template` for PBS/Torque systems

cluster:
        address: host.url
        description: cluster with PBS
        paths:
            home: /scratch/{user_name}
        queue_type: pbs
        status_check_interval: 30
        wait_for_task: false
        job_template: |
            #!/bin/bash
            #PBS -N {task}
            #PBS -l nodes={nodes}:ppn={ppn}
            #PBS -l walltime={walltime}
            #PBS -l mem={mem//10**9}GB
            #PBS -o /home/{user_name}/.sos/tasks/{task}.out
            #PBS -e /home/{user_name}/.sos/tasks/{task}.err
            #PBS -m ae
            #PBS -M email@address
            #PBS -v {workdir}
            {command}
        max_running_jobs: 100
        submit_cmd: qsub {job_file}
        status_cmd: qstat {job_id}
        kill_cmd: qdel {job_id}

Sample `task_template` for MOAB configuration

cluster:
        address: host.url
        description: cluster with MOAB
        paths:
            home: /scratch/{user_name}
        queue_type: pbs
        status_check_interval: 30
        wait_for_task: false
        job_template: |
            #!/bin/bash
            #PBS -N {task}
            #PBS -l nodes={nodes}:ppn={ppn}
            #PBS -l walltime={walltime}
            #PBS -l mem={mem//10**9}GB
            #PBS -o /home/{user_name}/.sos/tasks/{task}.out
            #PBS -e /home/{user_name}/.sos/tasks/{task}.err
            #PBS -m ae
            #PBS -M email@address
            #PBS -v {workdir}
            {command}
        max_running_jobs: 100
        submit_cmd: msub {job_file}
        status_cmd: qstat {job_id}
        kill_cmd: qdel {job_id}

Sample `task_template` for SLURM systems

slurm:
    description: cluster with SLURM
    address: host.url
    paths:
      home: /home/{user_name}
    queue_type: pbs
    status_check_interval: 120
    max_running_jobs: 15
    max_cores: 28 
    max_walltime: "36:00:00"
    max_mem: 256G
    job_template: |
      #!/bin/bash
      #SBATCH --time={walltime}
      #SBATCH --partition=mstephens
      #SBATCH --account=pi-mstephens
      #SBATCH --nodes=1
      #SBATCH --ntasks-per-node={cores}
      #SBATCH --mem-per-cpu={mem_per_cpu}
      #SBATCH --job-name={task}
      #SBATCH --output=/home/{user_name}/.sos/tasks/{task}.out
      #SBATCH --error=/home/{user_name}/.sos/tasks/{task}.err
      cd {workdir}
      {command}
    walltime: "06:00:00"
    cores: 20
    mem_per_cpu: 1000
    submit_cmd: sbatch {job_file}
    submit_cmd_output: "Submitted batch job {job_id}"
    status_cmd: squeue --job {job_id}
    kill_cmd: scancel {job_id}

Sample `task_template` for LFS systems

lfs:
        address: host.url
        description: cluster with LSF
        paths:
            home: /rsrch2/bcb/{user_name}            
        queue_type: pbs
        status_check_interval: 30
        wait_for_task: false
        job_template: |
            #!/bin/bash
            #BSUB -J {task}
            #BSUB -q {'short' if int(walltime.split(':')[0]) < 24 else 'long'}
            #BSUB -n {cores}
            #BSUB -M {mem//10**9}G
            #BSUB -W 1:0
            #BSUB -o /home/{user_name}/.sos/tasks/{task}.out
            #BSUB -e /home/{user_name}/.sos/tasks/{task}.err
            #BSUB -N
            #BSUB -u email@address
            cd {workdir}
            {command}
        max_running_jobs: 100
        submit_cmd: bsub < {job_file}
        submit_cmd_output: 'Job <{job_id}> is submitted to queue <{queue}>'
        status_cmd: bjobs {job_id}
        kill_cmd: bkill {job_id}

Sample `task_template` for task spooler

Task Spooler is a light-weight task spooler for single machines.

taskspooler:
        description: task spooler on a single machine
        address: {user_name}@host.url
        port: 32771
        paths:
            home: /home/{user_name}
        queue_type: pbs
        status_check_interval: 5
        task_template: |
            #!/bin/bash
            cd {workdir}
            {command}
        max_running_jobs: 100
        submit_cmd: tsp -L {task} sh {job_file}
        status_cmd: tsp -s {job_id}
        kill_cmd: tsp -r {job_id}

`workflow_template`

workflow_template defines how to execute a workflow on the host.

variable	usage
`filename`	filename of the script to be executed
`script`	content of the SoS workflow
`job_name`	an unique ID derived from content of workflow
`command`	command to be executed by SoS (`sos run ...`)
`VARIABLE` (any)	variables passed from host definition or command line

A workflow_template can be very similar or identical to task_template. However, in contrast to task_template where walltime, mem etc are converted and adjusted by SoS, these variables have to be fixed in the template or passed in string format to workflow_template, because variables for workflow_template can only be passed from command line such as

%run workflow -r host walltime=01:00:00

Template parameters can be used to increase the flexibility of templates. For example, you can specify the use of certain version of R for the execution of workflows using the following template

hpc_server:
    address: ....
    paths: ...
hosts:
    hpc:
        based_on: hpc_server
        R_version: 3.3.1
        workflow_template: |
            module load R/{R_version}
            {command}

and execute your workflow as follows:

sos run script -r hpc R_version=3.4.4

This method works but it requires you to specify the version of R each time, which can be hard to remember. You could make it easier by setting a default version as follows:

hosts:
    hpc:
        based_on: hpc_server
        R_version: 3.4.4
        workflow_template: |
            module load R/{R_version}
            {command}

In this way, variable R_version will be used for workflow_template, but will be overriden by R_version specified from command line.

If you would like to specify different versions in the template, you can define multiple hosts as follows:

hosts:
    hpc_r3.4.4:
        based_on: hpc_server
        R_version: 3.4.4
        workflow_template: |
            module load R/{R_version}
            {command}
    hpc_r3.6.0:
        based_on: hosts.hpc_r3.4.4:
        R_version: 3.6.0
    hpc_sklearn:
        based_on: hosts.hpc_r3.6.0
        workflow_template: |
            module load R/{R_version}
            module load sklearn
            {command}

and use these environments with commands

sos run script -r hpc_r3.6.0

These templates make use of the facts that

based_on copies specified entry
New definitions overrides contens from based_on items
Templates are expanded from variables defined in the same dictionary (e.g. R_version)

Host configuration

Basic Host configuration

Overall concepts

Use hostname or IP address directly

Basic host configuration

Use of identity files

Process queue

PBS queue

task_template

Sample task_template for PBS/Torque systems

Sample task_template for MOAB configuration

Sample task_template for SLURM systems

Sample task_template for LFS systems

Sample task_template for task spooler

workflow_template