March 29, 2018

How does SoS compare with other workflow engines

Over 200 workflow systems have been developed to date. Like any other software tools, many workflow systems are actively evolving with new features added from time to time. The goal of this blog post is to illustrate, by means of comparison to some of the most popular workflow systems similar to SoS, features and limitations of SoS as a conventional workflow system. It should be seen as a check-list of basic workflow features, in addition to the unique niche SoS places itself in the realm of workflow systems as explained in the next section and in other posts.

How does SoS compare with Nextflow, SnakeMake, Bpipe, CWL, and Galaxy

In comparison to most workflow systems that are designed for “consumers” of workflows with emphases on efficient execution of well-crafted workflows with hidden details, SoS is designed for “developers” of workflows for ad hoc data processing with emphases on lowering the barrier of using workflows in daily computational research. The following tables compare basic features, workflow features, and built-in support for external tools and services between SoS, NextFlow, Snakemake, Bpipe, CWL, and Galaxy.

Basic information

Workflow SoS NextFlow Snakemake Bpipe CWL Galaxy
Language Python based Groovy flavored GNU Make style, Python flavored Groovy flavored na na
The scripting language for workflow specification SoS extends Python 3.6 with a number of SoS-specific syntax extensions and pre-defined functions. Nextflow is based on Groovy syntax with Nextflow-defined functions and objects. See here for details. Snakemake is written in Python and has the flavor of Make system in syntax and execution. Bpipe is implemented in Groovy. Its syntax departs as little as possible from the simplicity of the shell script. CWL workflows are specified in JSON or YAML format. Galaxy's workflows are stored in JSON files together with GUI-related meta information
User interface CLI + Notebook (Jupyter) CLI CLI CLI CLI (cwltool) CLI + GUI
Primary methods for users to interact with the workflow engine SoS provides two sets of user interface: command line (sos command) and Jupyter magics (%run, %sorun etc) Nextflow workflows are executed with a nextflow command. Snakemake workflows are executed with a snakemake command. Bpipe workflows are executed with a bpipe command. cwltool has a CLI, but other workflow engines could provide a GUI Galaxy workflows are mostly executed using a web interface, but it can also be executed using a CLI.
File format .sos (plain text) and Jupyter notebook .nf (plain text) Snakefile (plain text) .pipe (groovy, plain text) .cwl and .yml (JSON/YAML) XML
Format(s) to save workflows SoS workflows can be saved in a plain text .sos format, or be embedded in a Jupyter Notebook with SoS kernel. Plain text file with .nf extension. Plain text file named Snakefile, or with *.rules extension for rules from another file. Plain text file with .pipe, or *.groovy extensions. CWL documents are written in JSON or YAML, or a mix of the two Galaxy files are saved by the framework and are not supposted to be edited directly.
IDE SoS Notebook (Jupyter) No No No No (cwltools) Yes (only for building DAG)
Integrated Development Environment SoS uses SoS Notebook, a companion Polyglot notebook environmnet based on Jupyter, as its IDE. No dedicated IDE is available, but users can IDEs that support groovy (e.g. Eclipse, Netbeans) to edit (but not execute) nextflow workflows. No dedicated IDE is available but syntax highlighter plugin are provided for some text editors. No dedicated IDE is available but editors supporting Groovy syntax can be use to facilicate pipeline development. No IDE is provided for cwltools, but other task engines might provide one A web interface is provided to create steps and connect them

Workflow features

Workflow SoS NextFlow Snakemake Bpipe CWL Galaxy
DAG Building Explicit DAG of steps by connecting steps, implicit by target matching Implicit DAG of steps from input/output Implicit DAG by files from pattern matching input/output Implicit DAG of steps from input/output Implicit DAG of steps from input/output Explicit DAG of steps by connecting steps
Methods and logic to construct dependency graphs connecting tasks in a workflow. SoS supports explicit forward-style (sequential numbered steps), makefile-style (dependency), and mixed-style of subworkflows, and steps can be explicitly dependent upon. Nextflow specifies process with input and output, and creates DAG of steps (the processes). Relies on filename (pattern) matching to determine execution sequence. Bpipe specifies stages with input and output, and creates DAG from the stages. DAG is constructed from source of steps DAG of galaxy is built explicitly using its web interface.
Streaming processing No Yes Yes No Optional No
Ability to process tasks inputs/outputs as a stream of data. SoS "data" are passed around as files. Processes in nextflow can communicate via asynchronous FIFO queues, called channels in the Nextflow lingo. From Snakemake 5.0 on, it is possible to mark output files as pipes. input and output variables are files. The input files can be "streamable" and may be handled by pipes Galaxy does not support streaming between steps.
Subworkflow Yes Yes Yes Yes Yes Yes
Support for executing subworkflows, potentially loaded from another pipeline file. SoS provides a sos_run(name) function to dynamically execute a subworkflow. Nextflow supports subworkflows through the use of submodules Rules can be loaded from other text files. Subworkflows can be achieved by setting input of one workflow explicitly as output of another workflow. Bpipe run keyword uses + operator to connect selected stages to pipeline. The Load statement can be used to import variable and pipeline stages from other files. A CWL workflow can be used in place of a regular CWL step Subworkflows are supported, although they cannot be generated dynamically as other workflow tools.
Atomic Write Yes Yes Yes Yes Implementation dependent Likely yes
Generate output only when the step completes so that failed steps do not leave incomplete output files. SoS uses step signature to track the output of steps and will remove partial output when the step fails. Nextflow steps are executed in a stage area so all outputs are complete. Snakemake uses .snakemake/incomplete_files to track paritial output files from failed runs. Output from failed steps got cleaned up so failed steps will not get in the way during re-execution The CWL specification does not require atomic write but individual workflow engine will likely implement it in some way We could not find any information related to how galaxy recovers from failed steps. It is likely that its steps are staged so writes are atomic.
Named input/output Yes Yes Yes No Yes Yes
Label step input and output and use the labels to connect steps as flow of data SoS supports named input and output through keyword arguments in input and output statements and refer to them with functions named_output The "from" part of input essentially names the input Snakemake support named input through keyword arguments in input and output statements. There seems to be no way to group input by names in bpipe CWL supports named output and the creation of data flow Galaxy workflows explicitly lables input and outputs
Modify and resume Yes Yes Yes Yes Yes (Optional for other engines) No (?)
Able to resume interrupted or modified workflow and ignore parts of the workflow that have been successfully executed SoS automatically keeps signatures of steps and tasks and can ignore steps and tasks that have already been executed, even if they were executed by a different workflow. Nextflow keeps track of all the processes executed in your pipeline. If you modify some parts of your script, only the processes that are actually changed will be re-executed. The execution of the processes that are not changed will be skipped and the cached result used instead. Similar to Make, Snakemake uses timestamps to determine modification status and resume points. Uses customized timestamp signature (at millisecond resolution) of input / output to determine modification. By default it does not check status of command or script changes. Pausing and resuming workflow is not part of the specification and is not required No information on runtime signature or restart of failed jobs could be found.
Buit-in remote execution Yes No No No No No
Send tasks to remote hosts for execution. SoS can execute entire workflows or individual tasks on multiple remote hosts, with file synchronization between heterogeneous file systems. Nextflow can be executed on a variety of environments but it has to be started within the environments Snakemake can be executed on a variety of environments but it has to be started within the environments Bpipe can be executed on a variety of environments but it has to be started within the environments The CWL specification does not contain any feature for remote execution. Galaxy can be executed on a variety of environments but it has to be started within the environments
Task monitoring Command line and GUI (Notebook), with summary report Report traces and performances Report traces and performance Event notification Implementation dependent GUI to explore, share and reuse histories
Ability to send tasks to multiple isolated computing environment and manage them from local host. "Report traces and performance" means that benchmarking commands and outputs are logged, along with resources usage such as CPU hours and memory consumption. SoS can monitor tasks through the Jupyter Notebook interface with magics (e.g %taskinfo) to retrieve details about the tasks. It can also monitor status of tasks through a command line interface (e.g. sos status). A summary report could be generated with option -p. Nextflow can generate complete reports with details on CPU/task usage etc. Benchmarking and logging Notification in Bpipe can be configured by Gmail, or genetric SMTP / XMPP protocols. It also provides commands such as send, succeed, fail for arbitrary notifications. There is no mentioning of job monitoring of jobs in CWL specification, but workflow engines should provide their own facilities for job monitoring The GUI shows the status of each step with colors.
Process-oriented workflow Yes Yes No Yes Yes Yes
Workflows that are constructed and executed by steps to execute. SoS' "forward-style" workflow specifies steps of workflows through sequencial numbering although a DAG could be constructed with target dependencies. Nextflow executes specified workflow with specified input and parameters. Snakemake workflow depends on filename wildcard pattern matching, not rule names, although rule order and rule priorities can be configured to change execution ordering. Bpipe workflow is process-oriented and executed in parallel. CWL execute workflows from specified steps and inputs, not from desired output Galaxy construct and execute workflows as connected steps.
Output-oriented workflow Yes No Yes No No No
Workflows that are constructed and executed by the "outcome" of the workflow. SoS' auxiliary steps specifies outcomes of steps and will be called when the target is needed. Nextflow executes specified workflow with specified input and parameters. Snakemake workflows are output-oriented: execution ordering relies on filename patterns (with exceptions). Bpipe does not use implicit file name pattern matching to construct pipelines, although it supports input file wildcards for running multiple stages simultaneously on different data. CWL does not use implicit workflow construction to execute workflow to generate specified outcomes Galaxy does not automatically build workflows from intended outcomes

Built-in support

Workflow SoS NextFlow Snakemake Bpipe CWL Galaxy
Docker Yes Yes Yes No Yes Yes
Support for docker A docker_image option can execute scripts inside specified docker images. Nextflow support docker containers. You can run all scripts in the specified docker image, or specify a docker image for each step. Snakemake supports the use of rule level containers. Bpipe does not have build-in support for containers. CWL specification supports docker Galaxy steps can execute docker run command with docker-flavored images.
Singularity Yes Yes Yes No Yes No
Support for singularity SoS supports singularity with action options container and enginem see SoS Singularity Guide for details. Nextflow supports singularity containers. It works similar to docker but with options such as singularty.enabled=true. Snakemake supports the use of rule level containers. Bpipe does not have build-in support for containers. Not mentioned in CWL specification but cwltool supports it Galaxy supports Singularity containers.
PBS/Torque/LSF/SLURM Needs template Yes Direct or via template Yes Yes Yes
Ability to execute workflows on a PBS-style computer cluster SoS interact with clusters through pre-configured templates and commands. It has been tested to work on Torque, LSF, SLURM, PBS, and Torque Nextflow supports Open grid, Univa grid, LSF, SLURM, PBS Works, Torque Snakemake can interact with clusters through templates, or directly if the cluster supports DRMAA. Bpipe provides build-in support for some resource manager systems, and a template-based system (adapter script) to support implementing resource managers. cwltool and other implementations supports cluster Galaxy can be deployed on clusters with steps executed on computing nodes.
HTCondor Require template (?) Yes Require template (?) Require template No (?) Yes
Ability to use HTCondor to execute workflows on large collections of distributively owned computing resources. Do not know because we have not had a chance to configure SoS to run on a HT Condor system. Nextflow supports HTCondor There is no built-in support for HTCondor, however we cannot find existing Snakemake HTCondor job templates either. There is no built-in support for HTCondor, however there seems to be third-party adapter scripts for HTCondor job scheduler. There seems to be no built-in support for HTCondor Galaxy supports HTCondor as described here.
Distributed Task Queue Yes (RQ) No No No No No
Ability to send tasks to distributed task queues such as RQ and Celery. SoS supports RQ, Celery support is likely broken due to lack of maitainence. Nextflow cannot submit tasks to external task queues Snakemake cannot submit tasks to external task queues Bpipe does not provide build-in support for external task queues cwltools does not support external task queues Galaxy does not support external task queues.
Distributed systems No Yes Experimental No Implementation dependent Yes
Ability to spawn the executions of pipeline tasks through a distributed cluster such as Apache Spark, Apache Ignite, Apache Mesos, and Kubernetes. No Nextflow supports distributed systems such as Apache Ignite and Kubernetes Snakemake 4.0 and later supports experimental execution in the cloud via Kubernetes. No No trace of support from cwltool but other workflow engines might support it Galaxy could be delopyed on top of Kubernetes as described here
Cloud Storage No Yes Yes No No Yes
Ability to make use of cloud storage (such as AWS). Not currently Nexflow can access S3 storage Snakemake can access files on cloud storage No No information could be found for support for cloud storage. This should again be implementation/engine specific. Galaxy objects could be stored on distributed store or Amazon S3 (c.f. Galaxy Object Store)

Is SoS for you?

SoS is not for everyone. As a workflow system:

  • If you are looking for a industrial-grade workflow system for the handling of millions of large jobs, you should look for proven solutions such as Luigi.
  • If you are aiming at the creation of “portable” workflows that can be executed in various cluster and cloud environments, NextFlow can be the first to try. Snakemake also has a wide user base and is a close draw with NextFlow in many aspects. Bpipe is also popular but seems to be less popular then NextFlow and SnakeMake.
  • If you are aiming at the creation of “general” workflows with no specific workflow engine in mind, CWL is currently the best bet as CWL workflows can be executed by multiple workflow engines in different environments.
  • If you are looking for a script-less GUI-based workflow system with the need for writing scripts, the answer is no because SoS is script based. Galaxy can be a good choice at least for bioinformatic applications.
  • If you are a Jupyter or JupyterLab user, the answer is most likely yes because SoS is embedded into SoS Notebook, which is by itself a polyglot notebook. You can enjoy all features of SoS Notebook and step into SoS only when needed.
  • If you would like to use a workflow system for daily exploratory data analysis and computaional research, SoS should be most usable since it is designed for interaction data analysis and execution of tasks on remote systems.

© Bo Peng, Ph.D. / MD Anderson Cancer Center All rights reserved