Edit this page on our live server and create a PR by running command !create-pr in the console panel

Basic SoS workflows

  • Difficulty level: easy
  • Time need to lean: 10 minutes or less
  • Key points:
    • A forward-style workflow consists of numerically numbered steps
    • Multiple workflows can be defined in a single SoS script or notebook
    • Optional input and output statements can be added to change how workflows are executed
    • The default input of a step if the output of its previous step

Simple workflows with numerically numbered steps

The workflows you have seen so far have numerically numbered steps. For example, this example from the first tutorial have a single workflow plot with steps plot_10 and plot_20 and SoS will execute the steps in numeric ordering.

In [1]:
xlsx2csv data/DEG.xlsx > DEG.csv

null device 
          1 

The workflow is executed by default with magic %run because only one workflow is defined in the script. You can also define multiple workflows and execute them by their names. For example, the following script defines two single-step workflows convert and plot. Because there is no default workflow, you will have to refer to them with their names:

%run convert

and

%run plot
In [2]:
xlsx2csv data/DEG.xlsx > DEG.csv

In [3]:
null device 
          1 

Default input of steps

As shown in in How to specify input and output files and process input files in groups, you can define input and output for each step.

Therefore, in the following workflow, the input statement of plot_20 can be ignored.

In [4]:
xlsx2csv data/DEG.xlsx > DEG.csv

null device 
          1 

Basic data-flow based workflows

We have shown the same workflows in the plot_10, plot_20 style, in the convert and plot style, and with and without specification of input and output. What will happen if you define a workflow in separate steps with input and output statements?

Let us first remove the intermediate DEG.csv,

In [5]:

and execute the plot step of the following workflow

In [6]:
xlsx2csv data/DEG.xlsx > DEG.csv

null device 
          1 

As you can see, although the step plot is requested, SoS executes both the convert and plot steps because the required input file csv_file (DEG.csv) does not exist. In this case, SoS will look for steps that produces DEG.csv and execute it to generate DEG.csv before plot is executed.

Passing output of steps with substeps *

Things can get a little bit complicated when a step has multiple substeps. As you can recall from How to specify input and output files and process input files in groups, multiple substeps can be defined by input option group_by, each with its own _input and _output. When the output of such a step is inherited by another step, these _output will become the _input of the substeps.

For example, after running fastqc on the input fastq files, we would like to process the generated HTML file and check if the qualities are ok. We use the beautifulsoup Python module and find all the <h2> headers. Without going into the details of the use of beautifulsoup to parse HTML files, you should notice that

  • No input is defined for step 20 so it takes the output of step 10 as its input.
  • The output of step 10 contains two groups, data/S20_R1_fastqc.html and data/S20_R2_fastqc.html, which becomes the input of two substeps of step 20.
  • The input of step 20 are processed one by one
In [7]:
S20_R1_fastqc Basic Statistics: [OK]
S20_R1_fastqc Per base sequence quality: [OK]
S20_R1_fastqc Per tile sequence quality: [OK]
S20_R1_fastqc Per sequence quality scores: [OK]
S20_R1_fastqc Per base sequence content: [FAIL]
S20_R1_fastqc Per sequence GC content: [FAIL]
S20_R1_fastqc Per base N content: [OK]
S20_R1_fastqc Sequence Length Distribution: [WARN]
S20_R1_fastqc Sequence Duplication Levels: [OK]
S20_R1_fastqc Overrepresented sequences: [FAIL]
S20_R1_fastqc Adapter Content: [OK]
S20_R2_fastqc Basic Statistics: [OK]
S20_R2_fastqc Per base sequence quality: [OK]
S20_R2_fastqc Per tile sequence quality: [OK]
S20_R2_fastqc Per sequence quality scores: [OK]
S20_R2_fastqc Per base sequence content: [FAIL]
S20_R2_fastqc Per sequence GC content: [FAIL]
S20_R2_fastqc Per base N content: [OK]
S20_R2_fastqc Sequence Length Distribution: [WARN]
S20_R2_fastqc Sequence Duplication Levels: [OK]
S20_R2_fastqc Overrepresented sequences: [FAIL]
S20_R2_fastqc Adapter Content: [OK]

If you would like to re-group the default input, you can redefine the input explicitly, or apply option group_by to the default input:

In [8]:
[##] 2 steps processed (1 job completed, 2 jobs ignored)