Customized or remote execution of workflows

Difficulty level: intermediate
Time need to lean: 30 minutes or less
Key points:
- Option -r host executes workflow on host, optionally through a workflow_template specified through host configuration.
- The remote host could be a regular server, or a cluster system, in which case the workflow could be executed using multiple computing nodes.
- Local and remote hosts should share directories (e.g. via NFS mount) for input and output files.

Option -r host executes workflow on host. Depending on the properties of host, this option allows you to

Execute workflows locally, but in a customized environment
Execute workflows on a remote server directly
Execute entire workflows on a remote cluster system with option -r
Execute entire workflows on a remote cluster system with job submission with options -r, -q

Please refer to host configuration for details on host configuration.

Customized local environment for workflow execution

Assuming a system with two R versions, a system R installation under /usr/local/bin and a local installation in a conda environment. The latter version is the default version but if for some reasons the system R is preferred (e.g. if a library is only available there), you can change the local PATH of the R action using an env option (see SoS actions for details.

[1] "R version 3.6.1 (2019-07-05)"

[1] "R version 3.5.2 (2018-12-20)"

This action level env configuration is very flexible (e.g. you can use different versions of R in the same workflow) but can be difficult to maintain if you have multiple R actions. If your intent to use the same version of R throughout the workflow, it is easier to execute the entire workflow in a customized environment.

To achieve this, you can define a host as follows, which has a default address of localhost.

Then, we will be using the conda version of R by default

[1] "R version 3.6.1 (2019-07-05)"

and be using the system R if we execute the workflow in the system_R host with option -r, despite the fact that system_R is just a localhost with a template

[1] "R version 3.5.2 (2018-12-20)"

As you can imagine, the template can set up a variety of different environment such as conda environments, debug environments, and using module load on a cluster system.

Execution of workflow on a remote host

If the host is a real remote host, then

sos run script workflow -r host [other options]

would execute the entire workflow on the host.

This option is useful if you would like to write the entire workflow for a remote host and execute the workflow with all input, software, and output files on the remote host. Typical use cases for this option are when the data is too large to be processed locally, or when the software is only available on the remote host.

For example, with a host definition similar to

hosts:
  bcb:
    address:  myserver.utexas.edu
    paths:
      home: /Users/bpeng1/scratch

the following cell execute the workflow on bcb

INFO: No matching tasks are identified. Use option -a to check all tasks.
INFO: Running default: 
null device 
          1 
INFO: default is completed.
INFO: Workflow default (ID=94ec1bfb48adfe06) is executed successfully with 1 completed step.

The resulting test.png are generated on bcb and is unavailable for local preview. You can however preview the file with -r option of magic %preview

In this case we do not define any workflow_template for bcb so the workflow is executed directly on bcb. If a workflow_template is defined, the workflow will be executed through the shell script that is expanded from the template.

Note that local configrations, including the ones specified with option -c will be transferred and used on the remote host, with only the localhost definition switched to be the remote host. It is therefore safe to use local configurations with option -r.

Executing entire workflows on remote cluster systems

If the remote host specified by option -r host is a cluster system, the workflow will be submitted to the cluster as a regular cluster job. The workflow_template of host will be used, using options specified from command line (-r host KEY=VALUE KEY=VALUE.

If node=1, sos will be executed with a single node, similar to what will happen if you execute the command locally. If node=n is specified, the workflow will be executed in a multi-node mode with -j automatically expanded to include remote workers that are assigned to the cluster job. see option -j for details.

Because no -q option is specified, all tasks will be executed as regular substeps (as long as no queue= option is specified) and be executed to the cluster nodes for execution. This method is most efficient for the execution of a large number of small tasks because there is no overhead for the creation and execution of external tasks. It is less efficient for workflows with varying "sizes" of tasks because all computing nodes may have to wait for one computing node to execute one substep. In which case the follow execution method, namely submiting additional tasks from computing nodes could be better.

We already discussed the use of option -q to submit tasks to remote hosts or cluster systems. The key idea is that the workflow is executed locally. The main sos process will monitor and wait for the completion of remote tasks, and continue after the completion of the remote tasks. Because tasks are executed separately, it can be safely killed and resumed, which is one of the main advantages of the SoS task execution model.

It becomes a bit tricky to submit the workflow to a cluster system while allowing the master sos process, now executed on a computing node, to submit additional jobs. The command can be as simple as

sos run script -r host -q host

but things can go wrong if you do not understand completely what is happening:

With option -r, the entire workflow is executed on host. Since host has a PBS queue, the workflow will be submitted as a regular cluster job. The resources needed for the master job needs to be specified from command line and expanded in workflow_template.
The master sos process will be executed with a transferred local config on a computing node, with an option -q host. Now, because the localhost is set to host by the -r option, the cluster will appear to be a localhost to the sos, but with a different IP address etc.
The sos process will try to submit the tasks to the headnode through commands similar to ssh host qsub job_id. It is therfore mandatary that the host can be accessed with the specified address from the computing nodes. This is usually not a problem but it is possible that the headnode has different outfacing and inward addresses, in which case you will have to define a different host for option -q.

hosts:
    htc:
       address: htc_cluster.mdanderson.edu
       description: HTC cluster (PBS)
       queue_type: pbs
       status_check_interval: 60
       submit_cmd: qsub {job_file}
       status_cmd: qstat {job_id}
       kill_cmd: qdel {job_id}        
       nodes: 2
       cores: 4
       walltime: 01:00:00
       mem: 4G
       workflow_template: |
           #!/bin/bash
           #PBS -N {job_name}
           #PBS -l nodes={nodes}:ppn={cores}
           #PBS -l walltime={walltime}
           #PBS -l mem={mem}
           #PBS -m n
           module load R
           {command}
       task_template: |                
           #!/bin/bash
           #PBS -N {job_name}
           #PBS -l nodes={nodes}:ppn={cores}
           #PBS -l walltime={walltime}
           #PBS -l mem={mem//10**9}GB
           #PBS -o ~/.sos/tasks/{task}.out
           #PBS -e ~/.sos/tasks/{task}.err
           #PBS -m n            
           module load R
           {command}