- Difficulty level: easy
- Time need to lean: 20 minutes or less
- Key points:
- SoS Workflow is extended from Python 3.6+
- SoS Workflow uses SoS Notebook as its IDE
- SoS workflows can be in plain text format, or be embedded in SoS notebooks
- SoS workflow makes it easy to execute scripts in many different ways and environments
- SoS Workflow supports both process-oriented and outcome-oriented (makefile-style) workflows
- SoS Workflow supports light workflow + external task apprach of workload districution
The SoS suite of tools, as its full name "Script of Scripts" suggests, is designed for data analysis using scripts in multiple languages. SoS Workflow System is designed to be readable, non-intrusive, and suitable for daily data analysis. This tutorial demonstrates the major features of SoS and explains the pros and cons of this unique workflow system.
The SoS workflow system extends the syntax of Python 3.6+ so any Python code can be used in a SoS Workflow. If you have a Python script, you can execute it with the sos
executor as the first step of a one-step workflow called default
.
SoS adds the following syntax to Python 3.6+
Syntax | Example | Usage |
---|---|---|
Script format of function call | sh: |
Calling a Python function with multi-line script as first parameter |
Section header | [step_10] |
Define workflow steps |
parameter statement |
parameter: cutoff=5 |
Define command line argument |
input statement |
input: "a.txt" |
Define input targets of steps |
output statement |
output: "a.txt" |
Define output targets of steps |
depends statement |
depends: sos_step('A') |
Define dependent targets of steps |
task statement |
task: walltime='24h' |
Define external tasks |
These 7 additional syntaxes and statements, and a number of Python functions and data types are all that have been added to Python 3.6. This makes SoS quite easy to understand and learn, at least if you are already familiar with basic Python syntax.
SoS Notebook is a powerful notebook system for interactive multi-language data analysis, and is the preferred IDE for SoS Workflow System.
For example, the following three code cells perform a multi-language data analysis where the first cell defines a few variables, the second cell runs a bash script to convert an excel file to csv format, and the last cell uses R to read the csv file and generate a plot. Three different kernels, SoS (based on Python 3.6+), bash_kernel, and IRkernel are used, and a %expand
magic is used to pass filenames from the SoS kernel to other kernels.
The SoS notebook is already a "workflow" in the sense that it presents a sequence of steps for a particular purpose. You can use "Run All Cells" to rerun the workflow, or even define some parameters and execute the notebook from command line using sos-papermill.
However, if you would like to execute the steps in a more flexible way, you can convert them to a workflow as follows:
and execute it either from within the notebook
Or from command line using commands such as
sos run why_sos.ipynb plot
The workflow can be improved in many ways but if you compare the notebook version and the workflow version of the workflow, you will see how easy it is to convert a notebook workflow to a formal SoS workflow. The script format of function calls syntax certainly helps here because it allows verbatim inclusion of scripts in SoS workflows.
Now, let us take one of the steps and try to run it with definitions of _input
and _output
.
Now, if rerun the same step, you will notice that the step is ignored due to saved signature because the step has exactly the same input, output and processing script. This does not really matter for this small job but could save you hours for bioinformatic data analysis since those tools could take hours to complete.
A typical workflow would involve the execution of multiple commands and use multiple languages and libraries, and it can be quite difficult to install them. If you do not have xlsx2csv
installed locally, you can execute the script in a container named pihizi/xlsx2csv
. All you need is an option to specify the container to use.
Again, let us assume that xlsx2csv
is a terribly resource demanding command that cannot be executed locally on your laptop, or if it is proprietary and is only available on a remote server, you can "pack" the script as a task and send it to a remote host for execution. For example, if you have a host bcb
set up to be used with SoS, you can add a task
statement to the step and use -q bcb
to send the script to bcb
for execution. SoS will automatically send input (if any) to the remote host, and retrieve output (DEG.csv
in this case) from the remote host automatically, even if local and remote hosts do not share file systems and have different paths.
SoS uses a generalized step dependency system to specify relationship between steps, which accommodate both process-oriented and outcome-oriented workflows.
For example, the above workflow could be written in the following style where input
and output
of steps are specified, and used in the script as variables _input
and _outpu
.
After removing existing files DEG.csv
and output.pdf
, we can executed this workflow option -t
(target) to generate output output.pdf
. Both convert
and plot
steps are executed because of the need to generate an intermediate file DEG.csv
. This style is called a data-flow style and more advanced versions of this workflow can accept patterns in a makefile style.
SoS blends process-oriented
and outcome-oriented
workflows so well that you do not have to thinking about styles and use SoS in a mixed workflow style. Without referring to another example, it is enough to show that we can execute the same plot
workflow with magic %sosrun plot
in a process-oriented style. However, because the input of this step does not exist, SoS looks for steps that generate this file and executes step convert
before plot
.
Note that the remote host can be a single server, or a task queue, and with proper configuration SoS will be able to submit tasks to cluster systems and wait for its completion. Through the use of external tasks, SoS encourages you to include all analytical steps in a workflow, and execute most of them locally while executing resource intensive parts on remote systems and clusters.