Edit this page on our live server and create a PR by running command !create-pr in the console panel

SoS workflow: a 20 minute overview

  • Difficulty level: easy
  • Time need to lean: 20 minutes or less
  • Key points:
    • SoS Workflow is extended from Python 3.6+
    • SoS Workflow uses SoS Notebook as its IDE
    • SoS workflows can be in plain text format, or be embedded in SoS notebooks
    • SoS workflow makes it easy to execute scripts in many different ways and environments
    • SoS Workflow supports both process-oriented and outcome-oriented (makefile-style) workflows
    • SoS Workflow supports light workflow + external task apprach of workload districution

The SoS suite of tools, as its full name "Script of Scripts" suggests, is designed for data analysis using scripts in multiple languages. SoS Workflow System is designed to be readable, non-intrusive, and suitable for daily data analysis. This tutorial demonstrates the major features of SoS and explains the pros and cons of this unique workflow system.

Simple Python 3.6+ based syntax

The SoS workflow system extends the syntax of Python 3.6+ so any Python code can be used in a SoS Workflow. If you have a Python script, you can execute it with the sos executor as the first step of a one-step workflow called default.

In [1]:
hello world

SoS adds the following syntax to Python 3.6+

Syntax Example Usage
Script format of function call
sh:
echo "I am sh"
Calling a Python function with multi-line script as first parameter
Section header
[step_10]
Define workflow steps
parameter statement
parameter: cutoff=5
Define command line argument
input statement
input: "a.txt"
Define input targets of steps
output statement
output: "a.txt"
Define output targets of steps
depends statement
depends: sos_step('A')
Define dependent targets of steps
task statement
task: walltime='24h'
Define external tasks

These 7 additional syntaxes and statements, and a number of Python functions and data types are all that have been added to Python 3.6. This makes SoS quite easy to understand and learn, at least if you are already familiar with basic Python syntax.

Integration with SoS Notebook

SoS Notebook is a powerful notebook system for interactive multi-language data analysis, and is the preferred IDE for SoS Workflow System.

For example, the following three code cells perform a multi-language data analysis where the first cell defines a few variables, the second cell runs a bash script to convert an excel file to csv format, and the last cell uses R to read the csv file and generate a plot. Three different kernels, SoS (based on Python 3.6+), bash_kernel, and IRkernel are used, and a %expand magic is used to pass filenames from the SoS kernel to other kernels.

In [2]:
In [3]:
In [4]:
pdf: 2

The SoS notebook is already a "workflow" in the sense that it presents a sequence of steps for a particular purpose. You can use "Run All Cells" to rerun the workflow, or even define some parameters and execute the notebook from command line using sos-papermill.

However, if you would like to execute the steps in a more flexible way, you can convert them to a workflow as follows:

In [5]:
In [6]:
In [7]:

and execute it either from within the notebook

In [8]:
null device 
          1 

Or from command line using commands such as

sos run why_sos.ipynb plot

The workflow can be improved in many ways but if you compare the notebook version and the workflow version of the workflow, you will see how easy it is to convert a notebook workflow to a formal SoS workflow. The script format of function calls syntax certainly helps here because it allows verbatim inclusion of scripts in SoS workflows.

Flexible ways to execute scripts

Runtime signature for executed steps

Now, let us take one of the steps and try to run it with definitions of _input and _output.

In [9]:

Now, if rerun the same step, you will notice that the step is ignored due to saved signature because the step has exactly the same input, output and processing script. This does not really matter for this small job but could save you hours for bioinformatic data analysis since those tools could take hours to complete.

In [10]:

Execute scripts in containers

A typical workflow would involve the execution of multiple commands and use multiple languages and libraries, and it can be quite difficult to install them. If you do not have xlsx2csv installed locally, you can execute the script in a container named pihizi/xlsx2csv. All you need is an option to specify the container to use.

In [11]:
HINT: Pulling docker image pihizi/xlsx2csv

Execute scripts on remote hosts

Again, let us assume that xlsx2csv is a terribly resource demanding command that cannot be executed locally on your laptop, or if it is proprietary and is only available on a remote server, you can "pack" the script as a task and send it to a remote host for execution. For example, if you have a host bcb set up to be used with SoS, you can add a task statement to the step and use -q bcb to send the script to bcb for execution. SoS will automatically send input (if any) to the remote host, and retrieve output (DEG.csv in this case) from the remote host automatically, even if local and remote hosts do not share file systems and have different paths.

In [12]:
INFO: No matching tasks are identified. Use option -a to check all tasks.
INFO: fe9f6d84cbaf1371 started

Flexible workflow syntax

SoS uses a generalized step dependency system to specify relationship between steps, which accommodate both process-oriented and outcome-oriented workflows.

For example, the above workflow could be written in the following style where input and output of steps are specified, and used in the script as variables _input and _outpu.

In [13]:

After removing existing files DEG.csv and output.pdf, we can executed this workflow option -t (target) to generate output output.pdf. Both convert and plot steps are executed because of the need to generate an intermediate file DEG.csv. This style is called a data-flow style and more advanced versions of this workflow can accept patterns in a makefile style.

In [14]:
null device 
          1 

SoS blends process-oriented and outcome-oriented workflows so well that you do not have to thinking about styles and use SoS in a mixed workflow style. Without referring to another example, it is enough to show that we can execute the same plot workflow with magic %sosrun plot in a process-oriented style. However, because the input of this step does not exist, SoS looks for steps that generate this file and executes step convert before plot.

In [15]:
null device 
          1 

Note that the remote host can be a single server, or a task queue, and with proper configuration SoS will be able to submit tasks to cluster systems and wait for its completion. Through the use of external tasks, SoS encourages you to include all analytical steps in a workflow, and execute most of them locally while executing resource intensive parts on remote systems and clusters.