- Difficulty level: intermediate
- Time need to lean: 10 minutes or less
- Key points:
- Signatures are useful in checking the integrity of intermediate files
- Option
-Tforces the rerun of upstream steps to ensure integrity of intermediate files traced()targets are always checked or re-generated
Dependency check of workflow steps
A workflow step is ready to execute if all input and dependent targets exist.
Suppose you are working on a project that involves the execution of workflows repeatedly with different input files or parameters. It is possible to lose track of intermediate files and obtain incorrect results.
Let us assume that you created the following workflow with step analyze accepting some parameter and write result to a file, and step summarize creating a report named out.md from the output of step analyze. A little trick here is that action report accept one or more input files from its input parameter and appends their contents to the end of the report.
Now, let us perform the analysis and generate a report:
Everything looks ok, so you would like to re-run the analysis using another parameter:
Do you see what the problem is here? When you run the step summarize of the workflow and its input out.txt already exist, SoS will simply execute the step so the analyze step is not executed again.
A common solution to this problem is to introduce a special step to clean up intemediate files. In a GNU Make system this involves the introduction of a clean target, and use it to remove intermediate files with commands like
make target --par 10
make clean
make target --par 20
We can do something similar and create a workflow as follows:
After the clean step, the next summarize step works ok.
You can even execute the clean step before summaize using compound workflow as follows:
An easier method, perhaps unique to SoS, is to force the workflow engine to trace the dependency of existing files. More specifically, with option -T (trace dependency), SoS will check the input and dependent targets and see if they are the result of another step, and rerun the steps even if the targets already exist.
In addition to avoiding the trouble of a clean step, this method is also more performant, because the analyze step will be ignored if its signature matches. This can be shown by running the same step twice as follows:
Because out.txt was generated with par=26, rerunning the workflow will not reproduce out.txt. This is clearly better than the clean approach, which will force the re-execution of the analyze step.
traced target
A traced target (e.g. traced('out.txt')) will always be verified or re-generated even if it already exists. The function converts its parameters to a sos_targets object and marks all its targets. Therefore, it accepts all parameters of sos_targets() and you can use it in the formats of
traced('a.txt')
traced('a.txt', 'b.txt')
traced(A='a.txt', B='b.txt')
...
The -T option is convenient but can be slow if your workflow handles a large amount of files because SoS will need to determine the dependent steps of all input and dependent files. In addition, users of your workflow may still produce erroneous output if they do not know the -T option.
If some intermediate files are important and you always want to make sure that they are up to date, you can mark them as traced by wrapping them in the traced function.
Now, when you execute this workflow, the analyze step will always be executed (or ignored if signature matches) to ensure the integrity of out.txt.
Note that the analyze step is ignored in the last case so you do not lose any productivity by rerunning the analyze step.