Edit this page on our live server and create a PR by running command !create-pr in the console panel

Error handling

  • Difficulty level: easy
  • Time need to lean: 10 minutes or less
  • Key points:
    • Option -e specifies how sos handled runtime options
    • -e default terminates the current step (and branch), but allowes other branches to complete.
    • -e ignore ignores errors and allow current and other branches to complete.
    • -e abort terminates the current and all running steps immediately.

Runtime errors happen from time to time. Depending on the nature of errors you can terminate the entire workflow brutally, gentaly, or ignore all errors.

Four error handling modes

Let us assume that an error happens at a substep of step, and we need to decide

  1. Should running steps or substeps be terminates immediately.
  2. Should the rest of the substeps of the failing step be executed if they have not been submitted.
  3. Should the unaffected branches of the DAG be executed while allowing the branch with failed step to terminate.
  4. Should SoS try to execute the steps after the failed step.

The choices to these questions are controlled by the following error modes, specified with option -e to command sos run (or magics %run etc in SoS Notebook):

mode running substeps pending substeps following steps unaffected branches exit status
default allow complete allow complete canceled allow complete failed
ignore allow complete allow complete allow complete allow complete success
abort aborted canceled canceled canceled failed

Let us use the following example workflow to demonstrate the different modes. In this workflow,

  1. Step 10 has three substeps that are executed in parallel for 2 seconds. The second substep will generate an error at the end of the step.
  2. Step 20 follows step 10 and will execute three substeps for 2 seconds.
  3. Step 30 has input: None so it will start at the same time as step 10. It is supposed to sleep 3 seconds.
  4. Step 40 will be executed after step 30 for 1 second.
In [1]:
Cell content saved to test_error_mode.sos, use option -r to also execute the cell.

The execution of this workflow in different error handling modes are depicated as follows:

No description has been provided for this image

default error mode

In [2]:
[..32m..32m.32m#.32m.32m#] Failed with 3 steps processed (4 jobs completed)
ERROR: [10]: [(id=a94d6f839e23ad66, index=1)]: Substep terminated
[default]: Exits with 1 pending step (20)

In the default error-handling mode, three substeps of step 10 and step 30 are started at the same time. After substep 10.1 failed, step 10 is stopped, but step 30 is allowed to completed, followed by step 40 because it is independent of step 10. Step 20 is canceled due to the error from step 10.

ignore error mode

In [3]:
[..32m.32m#.32m#.32m.32m##] 4 steps processed (6 jobs completed, 1 job ignored)

In the ignore error-handling mode, three substeps of step 10 and step 30 are started at the same time. After substep 10.1 failed, it produces an step_output with an invalid substep. The workflow continues to execute. The substep 20.1 is not executed, but the rest of two substeps are executed successfully. The other branch of the DAG (steps 30 and 40) are not affected by the error. The workflow is considered to be executed successfully in the end despite of the error.

abort error mode

In [4]:
[..32m.32m] Failed with 2 steps processed (3 jobs completed)
ERROR: [default]: [(id=a94d6f839e23ad66, index=1)]: Substep terminated
[default]: Exits with 2 pending steps (20, 40) and 1 running step (30)

In the abort error-handling mode, three substeps of step 10 and step 30 are started at the same time. After substep 10.1 failed, it stops step 10, as well as the step 30 which are still running. Steps 20 and 40 are cancelled as well.