NOTE: If you have a filesystem with non-ASCII filenames encoded not in UTF-8, the directory md5sum might not work across platform unless you use the Python 3 version of the script (use python3 on source code). There is no binary distribution for the Python3 version due to packaging difficulties.
When we transfer or backup a large number of files, it is difficult to verify if the files have been copied correctly. Although it is possible to compare files and directories with original source using dedicated file/directory comparison tools, or commands such as rsync -acv
, the source data is not always available and it can be very slow to compare two large directories.
This script tries to address this problem by extending the standard md5sum
program to allow it to handle directories, and produce partial checksum during the handling of large files. The MD5 checksum of the directory is generated by calculating the MD5 checksum of all files and subdirectories, and generate a checksum from a manifest file from these values. Entries in the manifest file are sorted so that the order at which files are processed does not affect the directory checksum. Because the manifest file contains file size information, the choice to calculate MD5 checksum based only on 1G of data (not necessarily the first 1G) of large files should be safe.
If you are extremely impatient, you can skip the rest of this page and use command
% md5sumd * -v | gzip > .manifest.md5.gz
to generate a fingerprint for all files under a directory and save it to file .manifest.md5.gz, and use command
% md5sumd -c .manifest.md5.gz
to check if the content of the directory has been altered during file transferring, system failure, or unintentional changes.
% md5sumd -h
usage: md5sumd [-h] [--version] [-c [CHECKSUM]] [-v]
[FILE_OR_DIR [FILE_OR_DIR ...]]
A tool that calculates the MD5 checksum of files and directories, and use it
to check the integrity of these files and directories. It has a interface that
is similar to the md5sum command, with support for checksum of directories.
positional arguments:
FILE_OR_DIR Calculate MD5 signature of one or more files and
directories and print MD5 checksums to the standard
output.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-c [CHECKSUM], --check [CHECKSUM]
Check the content of one or more files and directories
using a file that contains the checksum of these files
and directories. Gippped checksum file is acceptable.
If a file is unspecified or is -, read from the
standard input.
-v, --verbose If specified, this program will output checksum for
all files when the checksum of a directory is
calculated. Such information will help the --check
command to figure out what files have been changed if
a directory checksum mismatch happens. This option
will also enable a progress bar for file scanning.
Let us calculate the MD5 of a directory:
% md5sumd vtools
b93f839744cd53fb87981c8254cc7511 vtools
If we copy the directory to somewhere else, we can see the signature is still the same
% cp -r vtools ~/Temp
% md5sumd ~/Temp/vtools
b93f839744cd53fb87981c8254cc7511 ~/Temp/vtools/
If we change anything in that directory, the signature will be different
% rm ~/Temp/vtools/*.pyc
% md5sumd ~/Temp/vtools
c71b7236b19feb1682f1c7039e5df8f2 ~/Temp/vtools/
Now let us save the md5 checksum to a file,
% md5sumd vtools > vtools.md5
% md5sumd -c vtools.md5
vtools: OK
When we transfer the directory to another place, we can still use this command to validate its content as long as the directory name is not changed. Now, if we change the directory, and check again,
% rm -rf vtools/cache
% md5sumd --check vtools.md5
vtools: FAILED
It can be frustrating when a directory checksum mismatch happens but you have no idea what has been changed. An interesting feature of the md5sumd
command is that it can generate and output detailed file-level MD5 information and use it to figure out what exactly have been changed to a directory.
% md5sumd vtools -v > vtools.md5
Scanning 34366 files: 100%[====================================] 2,095,623,045 48.6M/s in 00:00:430
As you can see, the --verbose
option even enables a progress bar, which can be helpful for directories that contain a large number of files. The output of this command has a lot more information, and it is interesting to see that there are 1,843,428 files of a total size of 28,615,195,835 under this directory.
% head -5 vtools.md5
2efce10e113804fc8a6b4e81ffd54f2e vtools
## MD5 type num_files num_dirs filesize total_num_files total_filesize name
#2efce10e113804fc8a6b4e81ffd54f2e d 34366 2760 2095623045 1843428 28615195835 vtools
#d75dc7768044f85895001913ae2a19b1 - 1 0 191368 1 191368 vtools/MANIFEST
#80e0735f4483d04b6cd28cff95b9b28c - 1 0 4260 1 4260 vtools/MANIFEST.in
Then, if we change the directory a little bit and check it with the --check
option,
% rm -f vtools/*pyc
% rm vtools/source/*temp
% md5sumd -c vtools.md5
vtools/source: directory modified.
vtools/source/cgatools_wrap_py3.cpp_temp: file removed.
vtools/source/cgatools_py3.py_temp: file removed.
vtools/source/assoTests_wrap_py3.cpp_temp: file removed.
vtools/source/vt_sqlite3_py3.py_temp: file removed.
vtools/setup.pyc: file removed.
vtools/source/assoTests_py3.py_temp: file removed.
vtools: FAILED
Although the main strength of md5sumd
is its ability to calculate directory md5, it works well with files as well. For example, we can generate a md5 for all files and directories under a directory using command:
% cd vtools
% md5sumd -v * > vtools.md5
Scanning 65 files under annotation: 100%[============================] 194,919 17.0M/s in 00:00:000
Scanning 32123 files under boost_1_49_0: 100%[===================] 281,585,238 16.4M/s in 00:00:170
Scanning 700 files under build: 100%[============================] 180,236,646 97.0M/s in 00:00:010
Scanning 38 files under cgatools: 100%[===============================] 205,326 7.0M/s in 00:00:000
Scanning 10 files under dist: 100%[=============================] 103,912,176 265.9M/s in 00:00:000
Scanning 15 files under format: 100%[==================================] 33,932 5.0M/s in 00:00:000
Scanning 485 files under gsl: 100%[=================================] 1,758,005 7.2M/s in 00:00:000
Scanning 28 files under libplinkio: 100%[=============================] 138,531 9.9M/s in 00:00:000
Scanning 659 files under pyinstaller: 100%[========================] 9,993,226 33.3M/s in 00:00:000
Scanning 39 files under source: 100%[=============================] 2,637,686 156.2M/s in 00:00:000
Scanning 43 files under sqlite: 100%[=============================] 5,511,571 188.0M/s in 00:00:000
Scanning 121 files under test: 100%[==========================] 1,507,339,718 307.4M/s in 00:00:040
If anything has been changes, we can check the change of contents using command
% rm test/*DB*
% md5sumd --check vtools.md5
MANIFEST: OK
MANIFEST.in: OK
MANIFEST_local.txt: OK
README: OK
annotation: OK
boost_1_49_0: OK
build: OK
build_executable.py: OK
call_variants.py: OK
cgatools: OK
code_style.cfg: OK
dist: OK
format: OK
gsl: OK
libplinkio: OK
manage_resource.py: OK
pyinstaller: OK
release.py: OK
setup.py: OK
source: OK
sqlite: OK
test/dbSNP.DB: file removed.
test/gwasCatalog.DB: file removed.
test/evs.DB: file removed.
test/testNSFP-1.1_0.DB.gz: file removed.
test/testThousandGenomes.DB: file removed.
test/evs-hg19_20111107.DB.gz: file removed.
test/dbSNP.DB-journal: file removed.
test/evs-hg19_20111107.DB: file removed.
test/testNSFP.DB: file removed.
test: FAILED
vtools: OK
vtools.md5: FAILED
vtools.spec: OK
vtools_report: OK
vtools_report.log: OK
vtools_report.spec: OK
The md4sumd
command can read directly from a gzipped checksum file. This is useful when the checksum file gets large when the --verbose
option is used to list checksums of all files and directories under a large directory. For example, you can generate a checksum file using command
% md5sumd vtools gsl -v | gzip > checksum.gz
and check it directory using command
% md5sumd --check checksum.gz
The md5sumd --check
command will read from standard input if no filename (python 2.7 or higher) or a filename with name -
is specified. For example,
% md5sumd vtools gsl | md5sumd -c -
vtools: OK
gsl: OK
Note that md5sumd -c -
does not accept gzipped stream directly so if you have a gzipped manifest, you will need to pipe it through gzip -d
before it is sent to the md5sumd -c -
command.