Edit this page on our live server and create a PR by running command !create-pr in the console panel

Remove large intermediate files without breaking signatures

  • Difficulty level: easy
  • Time need to lean: 10 minutes or less
  • Key points:
    • zapped files have only their signatures

Removal of large intermediate files

SoS keep tracks of all intermediate files and will rerun steps only if any of the tracked files are removed or changed. However, it is often desired to remove some of the large non-essential intemediate files to reduce diskspace used by completed workflows, while allowing the workflow to be re-executed without these files. SoS provides a command

sos remove files --zap

to zap specified file, or for example

sos remove . --size +5G --zap

to zap all files larger than 5G. This command removes specified files but keeps a special {file}.zapped file with essential information (e.g. md5 signature, and size). SoS would consider a file exist when a .zapped file is present and will only regenerate the file if the actual file is needed for a later step.

For example, let us execute a workflow with output temp/result.txt, and temp/size.txt.

In [1]:
Cell content saved to test_remove.sos, use option -r to also execute the cell.
In [2]:
2000+0 records in
2000+0 records out
1024000 bytes transferred in 0.046135 secs (22195755 bytes/sec)

and let us zap the intermediate file temp/result.txt,

In [3]:
INFO: 80 tracked files are identified.
Zap tracked file temp/result.txt
INFO: 1 file zapped
result.txt.zapped size.txt

As you can see, temp/result.txt is replaced with temp/result.txt.zapped. Now if you rerun the workflow

In [4]: