WGS Mapping to Variant Calls - Version 1.0
Implementing one of the samtools
Workflows in SoS. Some steps were re-implemented using GATK4
instead of GATK3
/ Pichard
or samtools
, for better performance.
Multithreading: since GATK4 the multi-thread option -nct
is not available. One has to use Spark version to achieve it
Overview
This SoS workflow notebook contains three workflows:
get_ref_genome
which downloads reference genome data, in preparation for genotype callingget_known_variants
which downloads known variants data, in preparation of genotype callingdefault
, a sequence of commands to perform multi-sample genotype calling for given list of samples
All workflow steps are numerically ordered to reflect the execution logic. This is the most straightforward SoS workflow style, the "process-oriented" style.
To view available workflows and options,
Reference data preparation
To prepare reference genome data, run:
sos run WGS_Call.ipynb get_ref_genome
To prepare reference known variants data, run:
sos run WGS_Call.ipynb get_known_variants
These commands will download reference data to ./WGS_Call_SoS/ref
(work directory, can be configured via --wd
option as shown in the workflow help message above).
If you already have reference genome data bundle and known variants data you can use --ref-genome
,
--ref-variants
and --ref-indel
to specify them. Otherwise you should edit get_ref_genome
and get_known_variants
steps below to download them. The default is pointed to a toy example data-set for demonstration
Run the workflow
The workflow has been tested on Linux and Mac.
In brief, after installing SoS (see "Software Configuration" below),
you provide a text file of sample list named eg k9-test/test_samples.manifest
, with contents:
Sample_79162
Sample_75641
Under the same folder as this list file, you keep all the listed sample fastq
files. Then you run
sos run WGS_Call.ipynb --samples k9-test/test_samples.manifest -j2
to perform variant calling. -j2
option means to run 2 processes in parallel.
After it completes you'll find under folder ./WGS_Call_SoS
(can be configured):
WGS_Call.html
of this documentbam
folder of intermediate and processed BAM files. Only BAM files from the last preprocessing step is kept by default, unless option--no-zap
(which changesparameter: zap = ...
in the workflow) is appended to the command above.vcf
folder of genotype call. The final output isvcf/test_samples.filtered.vcf.gz
-- the file name is derived after your input sample list/path/to/this/test_samples.manifest
You can also "dryrun" the workflow to see what is actually going on:
sos dryrun WGS_Call.ipynb --samples k9-test/test_samples.manifest
Or, run / dryrun a few steps, eg, the first 5 steps.
sos dryrun WGS_Call.ipynb default:1-5 --samples k9-test/test_samples.manifest
dryrun
will print out all commands which you can collect to a file and run them separately (for debugging, for example).
Do we keep all loci even if they are not variants?
Yes we do (for obvious reasons)! This is why --variants_only
is False
by default. However it is a huge file to work on. We include an option --keep_coord
that provides a list of genomic coordinates like this:
1 2815013
1 7421527
1 11653215
1 12326997
1 14892273
1 16663755
This case, only selected loci (variants of interest) are kept for further investigation.
Software configuration
To install SoS, make sure you have Python 3.6, then:
pip install -U sos sos-notebook sos-pbs
python -m sos_notebook.install
To open the notebook:
jupyter notebook WGS_Call.ipynb &> /dev/null &
If you do not already have sos_notebook
on your computer, you will need to close this notebook after installation and open it again to see the sos_notebook
kernel take effect in your Juptyer.
To better display notebook edits from git
:
pip install nbdime
nbdime config-git --enable --global
Finally you need to install and configure docker
:
- Run commands below:
curl -fsSL get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
- Log out and log back in (no need to reboot computer)
create reference index and dict files
Known sites
NCBI dbSNP has reference SNP files for various species.
Sort and index VCF
The standard workflow for working with DNA sequence data consists of three major steps:
- Mapping
- Improvement
- Variant Calling
We skip the bwa
steps because our data has already been mapped.
Because BWA can sometimes leave unusual FLAG information on SAM records, it is helpful when working with many tools to first clean up read pairing information and flags.
To sort them from name order into coordinate order,
We extract relevant chromosomes and re-order to match ordering in reference,
Improvement
In order to reduce the number of miscalls of INDELs in your data it is helpful to realign your raw gapped alignment with the Broad’s GATK3 Realigner. This is necessary only when used with other genotype callers. In GATK4 best practice pipeline this step can be skipped if HaplotypeCaller is used.
Note: option -nct
does not apply to RealignerTargetCreator
and IndelRealigner
.
BQSR from the Broad’s GATK allows you to reduce the effects of analysis artefacts produced by your sequencing machines. It does this in two steps, the first analyses your data to detect covariates and the second compensates for those covariates by adjusting quality scores.
It is helpful at this point to compile all of the reads from each library together into one BAM, which can be done at the same time as marking PCR and optical duplicates. To identify duplicates we currently recommend the use of either the Picard or biobambam’s mark duplicates tool.
The following step only Marks duplicates instead of removing duplicates
It is possible to view summary of BAM files in SoS, using %preview
feature, for example,
Additionally you may find it helpful to prepare graphs and statistics to assist you in filtering your variants:
To view one of the summary plot in SoS,
Finally you will probably need to filter your data using commands such as:
Variant filtration is a subject worthy of an article in itself and the exact filters you will need to use will depend on the purpose of your study and quality and depth of the data used to call the variants.
References
- The 1000 Genomes Project Consortium - An Integrated map of genetic variation from 1092 human genomes Nature 491, 56–65 (01 November 2012) doi:10.1038/nature11632
- GATK Best Practices