- Difficulty level: intermediate
- Time need to lean: 10 minutes or less
- Key points:
- Signatures are useful in checking the integrity of intermediate files
- Option
-T
forces the rerun of upstream steps to ensure integrity of intermediate files traced()
targets are always checked or re-generated
Dependency check of workflow steps
A workflow step is ready to execute if all input and dependent targets exist.
Suppose you are working on a project that involves the execution of workflows repeatedly with different input files or parameters. It is possible to lose track of intermediate files and obtain incorrect results.
Let us assume that you created the following workflow with step analyze
accepting some parameter and write result to a file, and step summarize
creating a report named out.md
from the output of step analyze
. A little trick here is that action report
accept one or more input files from its input
parameter and appends their contents to the end of the report.
Now, let us perform the analysis and generate a report:
Everything looks ok, so you would like to re-run the analysis using another parameter:
Do you see what the problem is here? When you run the step summarize
of the workflow and its input out.txt
already exist, SoS will simply execute the step so the analyze
step is not executed again.
A common solution to this problem is to introduce a special step to clean up intemediate files. In a GNU Make system this involves the introduction of a clean
target, and use it to remove intermediate files with commands like
make target --par 10
make clean
make target --par 20
We can do something similar and create a workflow as follows:
After the clean
step, the next summarize
step works ok.
You can even execute the clean
step before summaize
using compound workflow as follows:
An easier method, perhaps unique to SoS, is to force the workflow engine to trace the dependency of existing files. More specifically, with option -T
(trace dependency), SoS will check the input and dependent targets and see if they are the result of another step, and rerun the steps even if the targets already exist.
In addition to avoiding the trouble of a clean
step, this method is also more performant, because the analyze
step will be ignored if its signature matches. This can be shown by running the same step twice as follows:
Because out.txt
was generated with par=26
, rerunning the workflow will not reproduce out.txt
. This is clearly better than the clean
approach, which will force the re-execution of the analyze
step.
traced
target
A traced
target (e.g. traced('out.txt')
) will always be verified or re-generated even if it already exists. The function converts its parameters to a sos_targets
object and marks all its targets. Therefore, it accepts all parameters of sos_targets()
and you can use it in the formats of
traced('a.txt') traced('a.txt', 'b.txt') traced(A='a.txt', B='b.txt') ...
The -T
option is convenient but can be slow if your workflow handles a large amount of files because SoS will need to determine the dependent steps of all input and dependent files. In addition, users of your workflow may still produce erroneous output if they do not know the -T
option.
If some intermediate files are important and you always want to make sure that they are up to date, you can mark them as traced
by wrapping them in the traced
function.
Now, when you execute this workflow, the analyze
step will always be executed (or ignored if signature matches) to ensure the integrity of out.txt
.
Note that the analyze
step is ignored in the last case so you do not lose any productivity by rerunning the analyze
step.