Dependency tracing

Difficulty level: intermediate
Time need to lean: 10 minutes or less
Key points:
- Signatures are useful in checking the integrity of intermediate files
- Option -T forces the rerun of upstream steps to ensure integrity of intermediate files
- traced() targets are always checked or re-generated

A problem with makefile-style workflows

Dependency check of workflow steps

A workflow step is ready to execute if all input and dependent targets exist.

Suppose you are working on a project that involves the execution of workflows repeatedly with different input files or parameters. It is possible to lose track of intermediate files and obtain incorrect results.

Let us assume that you created the following workflow with step analyze accepting some parameter and write result to a file, and step summarize creating a report named out.md from the output of step analyze. A little trick here is that action report accept one or more input files from its input parameter and appends their contents to the end of the report.

Now, let us perform the analysis and generate a report:

[##] 2 steps processed (2 jobs completed)

Everything looks ok, so you would like to re-run the analysis using another parameter:

[#] 1 step processed (1 job completed)

Do you see what the problem is here? When you run the step summarize of the workflow and its input out.txt already exist, SoS will simply execute the step so the analyze step is not executed again.

A clean up step

A common solution to this problem is to introduce a special step to clean up intemediate files. In a GNU Make system this involves the introduction of a clean target, and use it to remove intermediate files with commands like

make target --par 10
make clean
make target --par 20

We can do something similar and create a workflow as follows:

After the clean step, the next summarize step works ok.

[.32m.32m#] 1 step processed (1 job completed)
[##] 2 steps processed (2 jobs completed)

You can even execute the clean step before summaize using compound workflow as follows:

[#.32m##] 3 steps processed (3 jobs completed)

Dependency tracing with option `-T`

An easier method, perhaps unique to SoS, is to force the workflow engine to trace the dependency of existing files. More specifically, with option -T (trace dependency), SoS will check the input and dependent targets and see if they are the result of another step, and rerun the steps even if the targets already exist.

[##] 2 steps processed (2 jobs completed)

In addition to avoiding the trouble of a clean step, this method is also more performant, because the analyze step will be ignored if its signature matches. This can be shown by running the same step twice as follows:

Writing report with input out.txt

Because out.txt was generated with par=26, rerunning the workflow will not reproduce out.txt. This is clearly better than the clean approach, which will force the re-execution of the analyze step.

Forcing dependency tracing of selected files

`traced` target

A traced target (e.g. traced('out.txt')) will always be verified or re-generated even if it already exists. The function converts its parameters to a sos_targets object and marks all its targets. Therefore, it accepts all parameters of sos_targets() and you can use it in the formats of

    traced('a.txt')
    traced('a.txt', 'b.txt')
    traced(A='a.txt', B='b.txt')
    ...

The -T option is convenient but can be slow if your workflow handles a large amount of files because SoS will need to determine the dependent steps of all input and dependent files. In addition, users of your workflow may still produce erroneous output if they do not know the -T option.

If some intermediate files are important and you always want to make sure that they are up to date, you can mark them as traced by wrapping them in the traced function.

Now, when you execute this workflow, the analyze step will always be executed (or ignored if signature matches) to ensure the integrity of out.txt.

[##] 2 steps processed (2 jobs completed)

[##] 2 steps processed (2 jobs completed)

[##] 2 steps processed (1 job completed, 1 job ignored)

Note that the analyze step is ignored in the last case so you do not lose any productivity by rerunning the analyze step.