Edit this page on our live server and create a PR by running command !create-pr in the console panel

Action download

  • Difficulty level: easy
  • Time need to lean: 10 minutes or less
  • Key points:
    • Action download download listed URLs in parallel

Action download

Action download(URLs, dest_dir='.', dest_file=None, decompress=False, max_jobs=5) download files from specified URLs, which can be a list of URLs, or a string with tab, space or newline separated URLs.

  • If dest_file is specified, only one URL is allowed and the URL can have any form.
  • Otherwise all files will be downloaded to dest_dir. Filenames are determined from URLs so the URLs must have the last portion as the filename to save.
  • If decompress is True, .zip file, compressed or plan tar (e.g. .tar.gz) files, and .gz files will be decompressed to the same directory as the downloaded file.
  • max_jobs controls the maximum number of concurrent connection to each domain across instances of the download action. That is to say, if multiple steps from multiple workflows download files from the same website, at most max_jobs connections will be made. This option can therefore be used to throttle downloads to websites.

For example,

[10]
GATK_RESOURCE_DIR = '/path/to/resource'
GATK_URL = 'ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/'

download:   dest_dir=GATK_RESOURCE_DIR, expand=True
    {GATK_URL}/1000G_omni2.5.hg19.sites.vcf.gz
    {GATK_URL}/1000G_omni2.5.hg19.sites.vcf.gz.md5
    {GATK_URL}/1000G_omni2.5.hg19.sites.vcf.idx.gz
    {GATK_URL}/1000G_omni2.5.hg19.sites.vcf.idx.gz.md5

download the specified files to GATK_RESOURCE_DIR. The .md5 files will be automatically used to validate the content of the associated files. Note that

SoS automatically save signature of downloaded and decompressed files so the files will not be re-downloaded if the action is called multiple times. You can however still still specifies input and output of the step to use step signature

[10]
GATK_RESOURCE_DIR = '/path/to/resource'
GATK_URL = 'ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/'
RESOUCE_FILES =  '''1000G_omni2.5.hg19.sites.vcf.gz
    1000G_omni2.5.hg19.sites.vcf.gz.md5
    1000G_omni2.5.hg19.sites.vcf.idx.gz
    1000G_omni2.5.hg19.sites.vcf.idx.gz.md5'''.split() 
input: []
output:  [os.path.join(GATK_RESOURCE_DIR, x) for x in GATK_RESOURCE_FILES]
download([f'{GATK_URL}/{x}' for x in GATK_RESOURCE_FILES], dest=GATK_RESOURCE_DIR)

Note that the download action uses up to 5 processes to download files. You can change this number by adjusting system configuration sos_download_processes.