- Difficulty level: intermediate
- Time need to lean: 10 minutes or less
- Key points:
- Option
trunk_sizegroups small tasks into larger ones - Option
trunk_workersdetermines number of workers per master task - Tasks can be dispatched and executed on multiple nodes on a cluster system
- Option
From time to time you may face the problem with many small tasks, such as running millions of simulations or analyzing thousands of genes. Whereas each simulation or analysis takes just a few minutes to complete, the entire analysis will take a long time and needs to be performed on a cluster. However, most cluster systems does not welcome millions or small tasks as managing a large number of jobs can pose management challenges to the scheduler.
What users have usually done are running these analysis in batch, which works more or less like the following script if implemented in SoS:
The args option here determines what will be passed to the underlying bash command, which should contain {filename} as the filename of the temporary file generated by SoS. In this particular example we use
f'{{filename}} {batch*4+1} {(batch+1)*4}'
so that the following bash commands will be executed
bash {filename} 1 4
bash {filename} 5 8
bash {filename} 9 12
bash {filename} 13 16
for substeps with batch equals to 0, 1, 2 and 3 respectively.
Now that we have fewer number of jobs, we can submit the shell scripts to a batch system as tasks
The tasks in this example are executed locally but you can send the tasks to a remote host using
task: queue='host'
or
%run -q host
The trunk_size task option
The trunk_size=n option groups tasks into groups of size `n` before submitting them to an executor. As a special case, if option `trunk_size` is specified but with a value `None`, all tasks from the step will be grouped together.
The aforementioned example can be implemented in a much easier way as follows using the trunk_size task option:
In this example, 15 tasks are generated from 15 substeps, each running a bash script
echo "Processing {id}"
with id = 0, ..., 15 respectively.
With option trunk_size=4, the tasks are grouped into master tasks with names starting with M5_.
The trunk_workers task option
The trunk_workers=n option specify the number of concurrent workers in each task. Similar to option -j for commands sos run and sos execute, it accepts the specification of multiple worker processes on multiple nodes. The value of this parameter will affects variables such as nodes, cores, walltime and mem in task templates.
The master tasks by default execute subtasks sequentially. If the master task has a large number of subtasks and there are computing resources available, you can specifying another option trunk_workers to set the number of workers for each master task. For example, in the following SoS workflow, the 16 tasks are submitted as a single (master) task and will be processed by two workers.
When you submit a master task to the cluster system, you typically need to specify the resources needed for tasks, using options walltime, mem and cores. These variables will be translated and expanded in a task template (defined in SoS configuration files).
For example, with the following template for a SLURM-based cluster system,
#!/bin/bash
#SBATCH --time={walltime}
#SBATCH --nodes={nodes}
#SBATCH --ntasks-per-node={cores}
#SBATCH --mem-per-cpu={mem // cores // 1000000000}G
#SBATCH --job-name={task}
#SBATCH --output=/home/{user_name}/.sos/{task}.out
#SBATCH --error=/home/{user_name}/.sos/{task}.err
sos execute {task} -v {verbosity} -s {sig_mode} -r {run_mode}
A task with options mem='1G' and walltime='12h' will populate the template with variables mem=1000000000 (1G) and walltime=01:00:00`.
If you group multiple tasks using options trunk_size and trunk_worker, the template variables will be adjusted automatically. For example,
input: for_each=dict(id=range(16))
task: trunk_size=None, trunk_workers=2, mem='1G', walltime='1h'
bash: expand=True
echo "Processing {id+1}"
will generate mem='2000000000' (2G) and walltime=08:00:00' because two concurrent workers will use double the RAM, and each worker will process 8 tasks sequentially, in a total of 8 hours.
If you have many small tasks and would like to submit them as a small number or a single job on the cluster system, and would like to make use of multiple nodes to process them, you can specify more than one computing nodes with option trunk_workers. In this case, trunk_workers should be a list with its length indicating the number of nodes and its elements indicating the number of workers on each node.
For example, with the template above, running sos run test.sos -q slurm_cluster will create a single master task with 16 subtasks, processed by 8 workers on two computing nodes.
input: for_each=dict(id=range(16))
task: trunk_size=None, trunk_workers=[4, 4], mem='1G', walltime='1h'
bash: expand=True
echo "Processing {id+1}"
Under the hood, sos will
- determine number of nodes and set variable
nodesto2 - adjust
memandwalltimeaccording to number of concurrent and sequential tasks. - populate the task template with variables
nodes,mem,walltimeetc - submit the generated shell script to create a multi-node job
- Because the task is executed on a cluster system, it will automatically recognize the nodes and processes per node, create workers accordingly to process the tasks on multiple nodes.