WGS Mapping to Variant Calls - Version 1.0

Implementing one of the samtools Workflows in SoS. Some steps were re-implemented using GATK4 instead of GATK3 / Pichard or samtools, for better performance.

Multithreading: since GATK4 the multi-thread option -nct is not available. One has to use Spark version to achieve it

In [1]:
Revision Author Date Message
910d34f Gao Wang 2019-01-02 Simplify input syntax removing default concurrent=True specification
2dbccfc Gao Wang 2018-11-06 Add system mem requirements for WGS call demo
876d2d9 Gao Wang 2018-10-30 Update documentation
924d9ab Gao Wang 2018-10-14 Add docker image tags to workflow examples
c70ac44 Gao Wang 2018-09-04 Add a check for index files
1ea76f2 Gao Wang 2018-09-04 Set zap to True by default
b98d99e Gao Wang 2018-09-03 Update docker option
3bf8f4d Gao Wang 2018-09-03 Add a step to select given list of variants
33fd12c Gao Wang 2018-09-03 Get toy example to work
8f3b701 Gao Wang 2018-09-03 Add links to toy reference data


This SoS workflow notebook contains three workflows:

  • get_ref_genome which downloads reference genome data, in preparation for genotype calling
  • get_known_variants which downloads known variants data, in preparation of genotype calling
  • default, a sequence of commands to perform multi-sample genotype calling for given list of samples

All workflow steps are numerically ordered to reflect the execution logic. This is the most straightforward SoS workflow style, the "process-oriented" style.

To view available workflows and options,

In [2]:
usage: sos run WGS_Call.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters


Global Workflow Options:
  --samples . (as path)
                        a file containing sample names
  --wd WGS_Call_SoS (as path)
                        work directory
  --ref-genome  path(f'{wd:a}/ref/reference.fa')

                        name of reference genome file
  --ref-variants  path(f'{wd:a}/ref/known_variants.vcf.gz')

                        name of known variant sites file
  --ref-indel  path(f'{wd:a}/ref/known_variants.vcf.gz')

                        name of known indel file
  --genome-url 'https://github.com/vatlab/sos-docs/raw/master/src/examples/k9-test/reference.fa.gz'
                        download link for reference fasta file, in gz format
  --variants-url 'https://github.com/vatlab/sos-docs/raw/master/src/examples/k9-test/known.vcf.gz'
                        download link for known variants vcf file, in gz format
  --[no-]zap (default to True)
                        whether or not to remove files from intermediate steps
  --bcftools-filter '%QUAL>10'
                        bcftools default quality filter
  --ncpu 3 (as int)
                        number of CPU threads to use per process
  --chr-prefix chr
                        can be empty, or chr, or Chr depending on input data-
  --n-chrom 38 (as int)
                        Human is 23, Bovine is 29, Canine is 38; we focus on
                        autosome and X.
  --[no-]variants-only (default to False)
                        Set to True to only call variant sites not "wildtype"
  --keep-coord . (as path)
                        Keep a list of specified variants only

  get_ref_genome_1:     Download reference genome
  get_ref_genome_2:     Index reference genome
  get_known_variants_1: Download known variants
  get_known_variants_2: Index known variants
  :                     Export workflow to HTML file
  :                     BWA mapping & samtools postprocessing
  :                     Sort BAM files
  :                     Extract relevant chromosomes for analysis
  :                     Re-order bam files
  :                     Indel re-alignment
  :                     Base recalibration
  :                     Mark duplicates
  :                     Multi-sample variant calling -- pileup
  :                     Multi-sample variant calling -- call
  :                     Quality summary plots for VCF file
  :                     Quality filtering
  :                     Only extract regions we are interested in

Input data

Input in fastq format, paired-ended:

Sample_75641_2.fq.gz Sample_75641_2.fq.gz

Reference data preparation

To prepare reference genome data, run:

sos run WGS_Call.ipynb get_ref_genome

To prepare reference known variants data, run:

sos run WGS_Call.ipynb get_known_variants

These commands will download reference data to ./WGS_Call_SoS/ref (work directory, can be configured via --wd option as shown in the workflow help message above).

If you already have reference genome data bundle and known variants data you can use --ref-genome, --ref-variants and --ref-indel to specify them. Otherwise you should edit get_ref_genome and get_known_variants steps below to download them. The default is pointed to a toy example data-set for demonstration

Run the workflow

The workflow has been tested on Linux and Mac. In brief, after installing SoS (see "Software Configuration" below), you provide a text file of sample list named eg k9-test/test_samples.manifest, with contents:


Under the same folder as this list file, you keep all the listed sample fastq files. Then you run

sos run WGS_Call.ipynb --samples k9-test/test_samples.manifest -j2

to perform variant calling. -j2 option means to run 2 processes in parallel. After it completes you'll find under folder ./WGS_Call_SoS (can be configured):

  1. WGS_Call.html of this document
  2. bam folder of intermediate and processed BAM files. Only BAM files from the last preprocessing step is kept by default, unless option --no-zap (which changes parameter: zap = ... in the workflow) is appended to the command above.
  3. vcf folder of genotype call. The final output is vcf/test_samples.filtered.vcf.gz -- the file name is derived after your input sample list /path/to/this/test_samples.manifest

You can also "dryrun" the workflow to see what is actually going on:

sos dryrun WGS_Call.ipynb --samples k9-test/test_samples.manifest

Or, run / dryrun a few steps, eg, the first 5 steps.

sos dryrun WGS_Call.ipynb default:1-5 --samples k9-test/test_samples.manifest

dryrun will print out all commands which you can collect to a file and run them separately (for debugging, for example).

Do we keep all loci even if they are not variants?

Yes we do (for obvious reasons)! This is why --variants_only is False by default. However it is a huge file to work on. We include an option --keep_coord that provides a list of genomic coordinates like this:

1   2815013
1   7421527
1   11653215
1   12326997
1   14892273
1   16663755

This case, only selected loci (variants of interest) are kept for further investigation.

Software configuration

To install SoS, make sure you have Python 3.6, then:

pip install -U sos sos-notebook sos-pbs
python -m sos_notebook.install

To open the notebook:

jupyter notebook WGS_Call.ipynb &> /dev/null &

If you do not already have sos_notebook on your computer, you will need to close this notebook after installation and open it again to see the sos_notebook kernel take effect in your Juptyer.

To better display notebook edits from git:

pip install nbdime
nbdime config-git --enable --global

Finally you need to install and configure docker:

  1. Run commands below:
    curl -fsSL get.docker.com -o get-docker.sh
    sudo sh get-docker.sh
    sudo usermod -aG docker $USER
  2. Log out and log back in (no need to reboot computer)

Global parameter settings

In [3]:

Reference preparation

Reference genome

Use primary assembly sequence. See this post for details.

In [4]:

create reference index and dict files

In [5]:

Known sites

NCBI dbSNP has reference SNP files for various species.

In [6]:

Sort and index VCF

In [7]:

The standard workflow for working with DNA sequence data consists of three major steps:

  • Mapping
  • Improvement
  • Variant Calling
In [8]:


We skip the bwa steps because our data has already been mapped.

Because BWA can sometimes leave unusual FLAG information on SAM records, it is helpful when working with many tools to first clean up read pairing information and flags.

In [9]:

To sort them from name order into coordinate order,

In [10]:

We extract relevant chromosomes and re-order to match ordering in reference,

In [11]:
In [12]:


In order to reduce the number of miscalls of INDELs in your data it is helpful to realign your raw gapped alignment with the Broad’s GATK3 Realigner. This is necessary only when used with other genotype callers. In GATK4 best practice pipeline this step can be skipped if HaplotypeCaller is used.

Note: option -nct does not apply to RealignerTargetCreator and IndelRealigner.

In [13]:

BQSR from the Broad’s GATK allows you to reduce the effects of analysis artefacts produced by your sequencing machines. It does this in two steps, the first analyses your data to detect covariates and the second compensates for those covariates by adjusting quality scores.

In [14]:

It is helpful at this point to compile all of the reads from each library together into one BAM, which can be done at the same time as marking PCR and optical duplicates. To identify duplicates we currently recommend the use of either the Picard or biobambam’s mark duplicates tool.

The following step only Marks duplicates instead of removing duplicates

In [15]:

It is possible to view summary of BAM files in SoS, using %preview feature, for example,

In [16]:
%preview WGS_Call_SoS/test_samples_bam/Sample_75641_1.fixmate.sorted.cleaned.reordered.realigned.recal.dedup.bam
> WGS_Call_SoS/test_samples_bam/Sample_75641_1.fixmate.sorted.cleaned.reordered.realigned.recal.dedup.bam (335.6 KiB):
  ID: 'Sample_75641'    SM: 'Sample_75641'    LB: 'library1'    PL: 'ILLUMINA'    ...
  ID: 'bwa'    PN: 'bwa'    VN: '0.7.15-r1140'    CL: bwa mem -t 3 -M -R @RG\tID:Sample_75641\tSM:Sample_75641\tLB...Sample_75641_2.fq.gz    
  ID: 'GATK IndelRealigner'    VN: '3.8-1-0-gf15c1c3ef'    CL: knownAlleles=[(RodBinding name=knownAlleles source=/home/gao...ileForDebugging=null    
  ID: 'GATK ApplyBQSR'    VN: ''    CL: ApplyBQSR  --output /home/gaow/Documents/GIT/wiki/sos-docs/s...se	PN:GATK ApplyBQSR    
  ID: 'MarkDuplicates'    VN: 'Version:'    CL: MarkDuplicates  --INPUT /home/gaow/Documents/GIT/wiki/sos-do...se	PN:MarkDuplicates    
  SN: 'chr36'    LN: 30001    M5: '68869954ee006c7821633a97b0a62430'    UR: file:/home/gaow/Documents/GIT/wiki/sos-docs/src/examples/WGS...SoS/ref/reference.fa    
  SN: 'chr37'    LN: 30001    M5: 'af868c903d4fa5e8484cff65c55a6368'    UR: file:/home/gaow/Documents/GIT/wiki/sos-docs/src/examples/WGS...SoS/ref/reference.fa    
  SN: 'chr38'    LN: 30001    M5: '357ef977363e01badd94e19051529270'    UR: file:/home/gaow/Documents/GIT/wiki/sos-docs/src/examples/WGS...SoS/ref/reference.fa    
  SN: 'chrX'    LN: 30001    M5: 'f7a40f22108352fa28a6f721315ad064'    UR: file:/home/gaow/Documents/GIT/wiki/sos-docs/src/examples/WGS...SoS/ref/reference.fa    

Variant Calling

To convert your BAM file into genomic positions we first use mpileup to produce a BCF file that contains all of the locations in the genome. We use this information to call genotypes and reduce our list of sites to those found to be variant by passing this file into bcftools call.

In [17]:
In [18]:

Additionally you may find it helpful to prepare graphs and statistics to assist you in filtering your variants:

In [19]:

To view one of the summary plot in SoS,

In [20]:
%preview WGS_Call_SoS/test_samples_vcf/test_samples_plots/substitutions.0.pdf
> WGS_Call_SoS/test_samples_vcf/test_samples_plots/substitutions.0.pdf (9.8 KiB):

Finally you will probably need to filter your data using commands such as:

In [21]:

Variant filtration is a subject worthy of an article in itself and the exact filters you will need to use will depend on the purpose of your study and quality and depth of the data used to call the variants.

In [22]:


  • The 1000 Genomes Project Consortium - An Integrated map of genetic variation from 1092 human genomes Nature 491, 56–65 (01 November 2012) doi:10.1038/nature11632
  • GATK Best Practices