- Difficulty level: easy
- Time need to lean: 10 minutes or less
- Key points:
- Action
download
download listed URLs in parallel
- Action
Action download(URLs, dest_dir='.', dest_file=None, decompress=False, max_jobs=5)
download files from specified URLs, which can be a list of URLs, or a string with tab, space or newline separated URLs.
- If
dest_file
is specified, only one URL is allowed and the URL can have any form. - Otherwise all files will be downloaded to
dest_dir
. Filenames are determined from URLs so the URLs must have the last portion as the filename to save. - If
decompress
is True,.zip
file, compressed or plantar
(e.g..tar.gz
) files, and.gz
files will be decompressed to the same directory as the downloaded file. max_jobs
controls the maximum number of concurrent connection to each domain across instances of thedownload
action. That is to say, if multiple steps from multiple workflows download files from the same website, at mostmax_jobs
connections will be made. This option can therefore be used to throttle downloads to websites.
For example,
[10]
GATK_RESOURCE_DIR = '/path/to/resource'
GATK_URL = 'ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/'
download: dest_dir=GATK_RESOURCE_DIR, expand=True
{GATK_URL}/1000G_omni2.5.hg19.sites.vcf.gz
{GATK_URL}/1000G_omni2.5.hg19.sites.vcf.gz.md5
{GATK_URL}/1000G_omni2.5.hg19.sites.vcf.idx.gz
{GATK_URL}/1000G_omni2.5.hg19.sites.vcf.idx.gz.md5
download the specified files to GATK_RESOURCE_DIR
. The .md5
files will be automatically used to validate the content of the associated files. Note that
SoS automatically save signature of downloaded and decompressed files so the files will not be re-downloaded if the action is called multiple times. You can however still still specifies input and output of the step to use step signature
[10]
GATK_RESOURCE_DIR = '/path/to/resource'
GATK_URL = 'ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/'
RESOUCE_FILES = '''1000G_omni2.5.hg19.sites.vcf.gz
1000G_omni2.5.hg19.sites.vcf.gz.md5
1000G_omni2.5.hg19.sites.vcf.idx.gz
1000G_omni2.5.hg19.sites.vcf.idx.gz.md5'''.split()
input: []
output: [os.path.join(GATK_RESOURCE_DIR, x) for x in GATK_RESOURCE_FILES]
download([f'{GATK_URL}/{x}' for x in GATK_RESOURCE_FILES], dest=GATK_RESOURCE_DIR)
Note that the download
action uses up to 5 processes to download files. You can change this number by adjusting system configuration sos_download_processes
.