{ "cells": [ { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "# WGS Mapping to Variant Calls - Version 1.0\n", "\n", "Implementing one of the [`samtools` Workflows](http://www.htslib.org/workflow/#mapping_to_variant) in [SoS](https://github.com/vatlab/SOS). Some steps were re-implemented using `GATK4` instead of `GATK3` / `Pichard` or `samtools`, for better performance.\n", "\n", "**Multithreading: since GATK4 the multi-thread option `-nct` is not available. One has to use Spark version to achieve it**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "kernel": "SoS" }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
RevisionAuthorDateMessage
910d34fGao Wang2019-01-02Simplify input syntax removing default concurrent=True specification
2dbccfcGao Wang2018-11-06Add system mem requirements for WGS call demo
876d2d9Gao Wang2018-10-30Update documentation
924d9abGao Wang2018-10-14Add docker image tags to workflow examples
c70ac44Gao Wang2018-09-04Add a check for index files
1ea76f2Gao Wang2018-09-04Set zap to True by default
b98d99eGao Wang2018-09-03Update docker option
3bf8f4dGao Wang2018-09-03Add a step to select given list of variants
33fd12cGao Wang2018-09-03Get toy example to work
8f3b701Gao Wang2018-09-03Add links to toy reference data
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%revisions -s -n 10" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Overview\n", "\n", "This SoS workflow notebook contains three workflows:\n", "\n", "- `get_ref_genome` which downloads reference genome data, in preparation for genotype calling\n", "- `get_known_variants` which downloads known variants data, in preparation of genotype calling\n", "- `default`, a sequence of commands to perform multi-sample genotype calling for given list of samples\n", "\n", "All workflow steps are numerically ordered to reflect the execution logic. This is the most straightforward SoS workflow style, the \"process-oriented\" style. \n", "\n", "To view available workflows and options," ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "kernel": "SoS" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: sos run WGS_Call.ipynb [workflow_name | -t targets] [options] [workflow_options]\n", " workflow_name: Single or combined workflows defined in this script\n", " targets: One or more targets to generate\n", " options: Single-hyphen sos parameters (see \"sos run -h\" for details)\n", " workflow_options: Double-hyphen workflow-specific parameters\n", "\n", "Workflows:\n", " get_ref_genome\n", " get_known_variants\n", " \n", "\n", "Global Workflow Options:\n", " --samples . (as path)\n", " a file containing sample names\n", " --wd WGS_Call_SoS (as path)\n", " work directory\n", " --ref-genome path(f'{wd:a}/ref/reference.fa')\n", "\n", " name of reference genome file\n", " --ref-variants path(f'{wd:a}/ref/known_variants.vcf.gz')\n", "\n", " name of known variant sites file\n", " --ref-indel path(f'{wd:a}/ref/known_variants.vcf.gz')\n", "\n", " name of known indel file\n", " --genome-url 'https://github.com/vatlab/sos-docs/raw/master/src/examples/k9-test/reference.fa.gz'\n", " download link for reference fasta file, in gz format\n", " --variants-url 'https://github.com/vatlab/sos-docs/raw/master/src/examples/k9-test/known.vcf.gz'\n", " download link for known variants vcf file, in gz format\n", " --[no-]zap (default to True)\n", " whether or not to remove files from intermediate steps\n", " --bcftools-filter '%QUAL>10'\n", " bcftools default quality filter\n", " --ncpu 3 (as int)\n", " number of CPU threads to use per process\n", " --chr-prefix chr\n", " can be empty, or chr, or Chr depending on input data-\n", " set.\n", " --n-chrom 38 (as int)\n", " Human is 23, Bovine is 29, Canine is 38; we focus on\n", " autosome and X.\n", " --[no-]variants-only (default to False)\n", " Set to True to only call variant sites not \"wildtype\"\n", " --keep-coord . (as path)\n", " Keep a list of specified variants only\n", "\n", "Sections\n", " get_ref_genome_1: Download reference genome\n", " get_ref_genome_2: Index reference genome\n", " get_known_variants_1: Download known variants\n", " get_known_variants_2: Index known variants\n", " : Export workflow to HTML file\n", " : BWA mapping & samtools postprocessing\n", " : Sort BAM files\n", " : Extract relevant chromosomes for analysis\n", " : Re-order bam files\n", " : Indel re-alignment\n", " : Base recalibration\n", " : Mark duplicates\n", " : Multi-sample variant calling -- pileup\n", " : Multi-sample variant calling -- call\n", " : Quality summary plots for VCF file\n", " : Quality filtering\n", " : Only extract regions we are interested in\n" ] } ], "source": [ "!sos run WGS_Call.ipynb -h" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Input data\n", "\n", "Input in `fastq` format, paired-ended:\n", "\n", "```\n", "Sample_75641_2.fq.gz Sample_75641_2.fq.gz\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Reference data preparation\n", "\n", "To prepare reference genome data, run:\n", "\n", "```\n", "sos run WGS_Call.ipynb get_ref_genome\n", "```\n", "\n", "To prepare reference known variants data, run:\n", "\n", "```\n", "sos run WGS_Call.ipynb get_known_variants\n", "```\n", "\n", "These commands will download reference data to `./WGS_Call_SoS/ref` \n", "(work directory, can be configured via `--wd` option as shown in the workflow help message above). \n", "\n", "If you already have reference genome data bundle and known variants data you can use `--ref-genome`, \n", "`--ref-variants` and `--ref-indel` to specify them. Otherwise you should edit `get_ref_genome` and `get_known_variants`\n", "steps below to download them. The default is pointed to a toy example data-set for demonstration" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Run the workflow\n", "\n", "The workflow has been tested on Linux and Mac. \n", "In brief, after installing [SoS](https://github.com/vatlab/SOS) (see \"Software Configuration\" below), \n", "you provide a text file of sample list named eg `k9-test/test_samples.manifest`, with contents:\n", "\n", "```\n", "Sample_79162\n", "Sample_75641\n", "```\n", "\n", "Under the same folder as this list file, you keep all the listed sample `fastq` files. Then you run\n", "\n", "```\n", "sos run WGS_Call.ipynb --samples k9-test/test_samples.manifest -j2\n", "```\n", "\n", "to perform variant calling. `-j2` option means to run 2 processes in parallel. \n", "After it completes you'll find under folder `./WGS_Call_SoS` (can be configured):\n", "\n", "1. `WGS_Call.html` of this document\n", "2. `bam` folder of intermediate and processed BAM files. Only BAM files from the last preprocessing step is kept by default, unless option `--no-zap` (which changes `parameter: zap = ...` in the workflow) is appended to the command above.\n", "3. `vcf` folder of genotype call. The final output is `vcf/test_samples.filtered.vcf.gz` -- the file name is derived after your input sample list `/path/to/this/test_samples.manifest`\n", "\n", "\n", "You can also \"dryrun\" the workflow to see what is actually going on:\n", "\n", "```\n", "sos dryrun WGS_Call.ipynb --samples k9-test/test_samples.manifest\n", "```\n", "\n", "Or, run / dryrun a few steps, eg, the first 5 steps.\n", "\n", "```\n", "sos dryrun WGS_Call.ipynb default:1-5 --samples k9-test/test_samples.manifest\n", "```\n", "\n", "`dryrun` will print out all commands which you can collect to a file and run them separately (for debugging, for example).\n", "\n", "### Do we keep all loci even if they are not variants?\n", "\n", "Yes we do (for obvious reasons)! This is why `--variants_only` is `False` by default. However it is a huge file to work on. We include an option `--keep_coord` that provides a list of genomic coordinates like this:\n", "\n", "```\n", "1\t2815013\n", "1\t7421527\n", "1\t11653215\n", "1\t12326997\n", "1\t14892273\n", "1\t16663755\n", "```\n", "\n", "This case, only selected loci (variants of interest) are kept for further investigation." ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Software configuration\n", "\n", "To install SoS, make sure you have Python 3.6, then:\n", "\n", "```\n", "pip install -U sos sos-notebook sos-pbs\n", "python -m sos_notebook.install\n", "```\n", "\n", "To open the notebook:\n", "\n", "```\n", "jupyter notebook WGS_Call.ipynb &> /dev/null &\n", "```\n", "\n", "If you do not already have `sos_notebook` on your computer, you will need to close this notebook after installation and open it again to see the `sos_notebook` kernel take effect in your Juptyer.\n", "\n", "To better display notebook edits from `git`:\n", "\n", "```\n", "pip install nbdime\n", "nbdime config-git --enable --global\n", "```\n", "\n", "Finally you need to install and configure `docker`:\n", "\n", "1. Run commands below:\n", "```\n", "curl -fsSL get.docker.com -o get-docker.sh\n", "sudo sh get-docker.sh\n", "sudo usermod -aG docker $USER\n", "```\n", "2. Log out and log back in (no need to reboot computer)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Global parameter settings" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "[global]\n", "import sys, os\n", "# a file containing sample names\n", "parameter: samples = path() \n", "# work directory\n", "parameter: wd = path('./WGS_Call_SoS')\n", "# name of reference genome file\n", "parameter: ref_genome = path(f'{wd:a}/ref/reference.fa')\n", "# name of known variant sites file\n", "parameter: ref_variants = path(f'{wd:a}/ref/known_variants.vcf.gz')\n", "# name of known indel file\n", "parameter: ref_indel = path(f'{wd:a}/ref/known_variants.vcf.gz')\n", "# download link for reference fasta file, in gz format\n", "parameter: genome_url = \"https://github.com/vatlab/sos-docs/raw/master/src/examples/k9-test/reference.fa.gz\"\n", "# download link for known variants vcf file, in gz format\n", "parameter: variants_url = \"https://github.com/vatlab/sos-docs/raw/master/src/examples/k9-test/known.vcf.gz\"\n", "# whether or not to remove files from intermediate steps\n", "parameter: zap = True\n", "# bcftools default quality filter\n", "parameter: bcftools_filter = '%QUAL>10'\n", "# number of CPU threads to use per process\n", "parameter: ncpu = 3\n", "# can be empty, or chr, or Chr depending on input data-set.\n", "parameter: chr_prefix = 'chr'\n", "# Human is 23, Bovine is 29, Canine is 38; \n", "# we focus on autosome and X.\n", "parameter: n_chrom = 38\n", "chroms = ' '.join([f'{chr_prefix}{x+1}' for x in range(n_chrom)] + [f'{chr_prefix}X'])\n", "# Set to True to only call variant sites not \"wildtype\"\n", "parameter: variants_only = False\n", "# Keep a list of specified variants only\n", "parameter: keep_coord = path()" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Reference preparation\n", " \n", "### Reference genome\n", "\n", "Use primary assembly sequence. See [this post](https://bioinformatics.stackexchange.com/questions/540/what-ensembl-genome-version-should-i-use-for-alignments-e-g-toplevel-fa-vs-p?rq=1) for details." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Download reference genome\n", "[get_ref_genome_1]\n", "output: ref_genome\n", "download: dest_file = f'{_output}.gz', expand = True\n", " {genome_url}\n", "bash: container='gaow/debian-ngs:1.0', volumes=[f'{wd:a}:{wd:a}'], expand = True\n", " gunzip -f {ref_genome}.gz" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "create reference index and dict files" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Index reference genome\n", "[get_ref_genome_2]\n", "depends: system_resource(mem = '4G')\n", "output: f'{ref_genome:n}.dict', f'{ref_genome}.fai', f'{ref_genome}.bwt',\n", " f'{ref_genome}.pac', f'{ref_genome}.ann', f'{ref_genome}.amb', f'{ref_genome}.sa'\n", "stop_if(all([path(x).is_file() for x in _output]))\n", "run: container='gaow/debian-ngs-gatk4:1.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{ref_genome}.err', stdout=f'{ref_genome}.out'\n", " rm -f {_output}\n", " samtools faidx {_input} && \\\n", " bwa index {_input} && \\\n", " gatk CreateSequenceDictionary -R {_input} -O {_output[0]}" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "### Known sites\n", "\n", "[NCBI dbSNP](ftp://ftp.ncbi.nih.gov/snp/organisms/archive) has reference SNP files for various species." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Download known variants\n", "[get_known_variants_1]\n", "output: f'{ref_variants:nn}.unsorted.vcf.gz'\n", "download: dest_file = f'{_output}', expand = True\n", " {variants_url}" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "Sort and index VCF" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Index known variants\n", "[get_known_variants_2]\n", "depends: system_resource(mem = '4G')\n", "output: ref_variants\n", "bash: container='gaow/debian-ngs-gatk4:1.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'\n", " gatk SortVcf -I {_input} -O {_output} \\\n", " --SEQUENCE_DICTIONARY {ref_genome:n}.dict \\\n", " --COMPRESSION_LEVEL 9 && \\\n", " gatk IndexFeatureFile -F {_output}\n", "if zap:\n", " _input.zap()" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "The standard workflow for working with DNA sequence data consists of three major steps:\n", "\n", "- Mapping\n", "- Improvement\n", "- Variant Calling" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Export workflow to HTML file\n", "[0]\n", "input: [item for item in sys.argv if item.endswith('.ipynb')], group_by = 1, pattern = '{fn}.ipynb'\n", "output: expand_pattern(f'{wd:a}/{path(_fn[0]):b}.html')\n", "bash: expand=True, stderr=False\n", " sos convert {_input} {_output}" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Mapping" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "We skip the `bwa` steps because our data has already been mapped. \n", "\n", "Because BWA can sometimes leave unusual FLAG information on SAM records, it is helpful when working with many tools to first clean up read pairing information and flags." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# BWA mapping & samtools postprocessing\n", "[1]\n", "depends: system_resource(mem = '4G')\n", "fail_if(not samples.is_file(), msg = f'Sample name file ``{samples}`` not found!')\n", "sample_files = list(set(get_output(f\"awk '{{print $1}}' {samples:e}\").strip().split('\\n')))\n", "fail_if(len(sample_files) == 0, msg = 'Need input sample files!')\n", "input: [(f'{samples:d}/{x}_1.fq.gz', f'{samples:d}/{x}_2.fq.gz') for x in sample_files], group_by = 2\n", "output: f'{wd:a}/{samples:bn}_bam/{_input[0]:bnn}.fixmate.bam'\n", "run: container='gaow/debian-ngs:1.0', volumes=[f'{samples:da}:{samples:da}', f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'\n", " bwa mem -t {ncpu} -M -R \"@RG\\\\tID:{os.path.basename(sample_files[_index].replace('-', '_'))}\\\\tSM:{os.path.basename(sample_files[_index].replace('-', '_'))}\\\\tLB:library1\\\\tPL:ILLUMINA\\\\tPU:{os.path.basename(sample_files[_index].replace('-', '_'))}.library1\" {ref_genome} {_input[0]} {_input[1]} | samtools sort -n - | samtools fixmate -O bam - {_output}" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "To sort them from name order into coordinate order," ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Sort BAM files\n", "[2]\n", "# sort via GATK4 is way faster\n", "# https://software.broadinstitute.org/gatk/blog?id=9644\n", "output: f'{_input:n}.sorted.bam'\n", "run: container='gaow/debian-ngs-gatk4:1.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'\n", " gatk SortSam -I {_input} -O {_output} -SO coordinate\n", "if zap:\n", " _input.zap()" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "We extract relevant chromosomes and re-order to match ordering in reference," ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Extract relevant chromosomes for analysis\n", "[3]\n", "output: f'{_input:n}.cleaned.bam'\n", "run: container='gaow/debian-ngs-gatk4:1.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'\n", " samtools index {_input}\n", " samtools view -@ {ncpu} -b {_input} {chroms} > {_output}\n", "if zap:\n", " _input.zap()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Re-order bam files\n", "[4]\n", "output: f'{_input:n}.reordered.bam'\n", "run: container='gaow/debian-ngs-gatk4:1.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'\n", " gatk ReorderSam -I {_input} -O {_output} -R {ref_genome} \\\n", " --CREATE_INDEX TRUE \\\n", " --ALLOW_INCOMPLETE_DICT_CONCORDANCE TRUE\n", "if zap:\n", " _input.zap()" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Improvement\n", "\n", "In order to reduce the number of miscalls of INDELs in your data it is helpful to realign your raw gapped alignment with the Broad’s GATK3 Realigner. This is necessary only when used with other genotype callers. In GATK4 best practice pipeline this step can be skipped if HaplotypeCaller is used.\n", "\n", "Note: option `-nct` does not apply to `RealignerTargetCreator` and `IndelRealigner`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Indel re-alignment\n", "[5]\n", "output: f'{_input:n}.realigned.bam'\n", "run: container='gaow/gatk3:2.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'\n", " java -jar /opt/GenomeAnalysisTK.jar -T RealignerTargetCreator \\\n", " -R {ref_genome} \\\n", " -I {_input} \\\n", " -o {_input:n}.intervals \\\n", " --known {ref_indel} && \\\n", " java -jar /opt/GenomeAnalysisTK.jar -T IndelRealigner \\\n", " -R {ref_genome} \\\n", " -I {_input} \\\n", " -targetIntervals {_input:n}.intervals \\\n", " -known {ref_indel} \\\n", " -o {_output}\n", "if zap:\n", " _input.zap()" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "BQSR from the Broad’s GATK allows you to reduce the effects of analysis artefacts produced by your sequencing machines. It does this in two steps, the first analyses your data to detect covariates and the second compensates for those covariates by adjusting quality scores." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Base recalibration\n", "[6]\n", "output: f'{_input:n}.recal.bam'\n", "run: container='gaow/debian-ngs-gatk4:1.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output}.err', stdout=f'{_output}.out'\n", " gatk BaseRecalibrator \\\n", " -R {ref_genome} \\\n", " --known-sites {ref_variants} \\\n", " -I {_input} -O {_input:n}.table && \\\n", " gatk ApplyBQSR -I {_input} -bqsr {_input:n}.table -O {_output} # --create-output-bam-index TRUE\n", "if zap:\n", " _input.zap()" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "It is helpful at this point to compile all of the reads from each library together into one BAM, which can be done at the same time as marking PCR and optical duplicates. To identify duplicates we currently recommend the use of either the Picard or biobambam’s mark duplicates tool.\n", "\n", "*The following step only Marks duplicates instead of removing duplicates*" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Mark duplicates\n", "[7]\n", "output: f'{_input:n}.dedup.bam'\n", "run: container='gaow/debian-ngs-gatk4:1.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'\n", " gatk MarkDuplicates \\\n", " --VALIDATION_STRINGENCY LENIENT \\\n", " --INPUT {_input} --OUTPUT {_output} \\\n", " --METRICS_FILE {_output:n}.metrics_file \\\n", " --CREATE_INDEX TRUE\n", "if zap:\n", " _input.zap()" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "It is possible to view summary of BAM files in SoS, using `%preview` feature, for example," ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "kernel": "SoS" }, "outputs": [ { "data": { "text/html": [ "
%preview WGS_Call_SoS/test_samples_bam/Sample_75641_1.fixmate.sorted.cleaned.reordered.realigned.recal.dedup.bam
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
> WGS_Call_SoS/test_samples_bam/Sample_75641_1.fixmate.sorted.cleaned.reordered.realigned.recal.dedup.bam (335.6 KiB):
" ], "text/plain": [ "\n", "> WGS_Call_SoS/test_samples_bam/Sample_75641_1.fixmate.sorted.cleaned.reordered.realigned.recal.dedup.bam (335.6 KiB):" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "RG:\n", " ID: 'Sample_75641' SM: 'Sample_75641' LB: 'library1' PL: 'ILLUMINA' ...\n", "PG:\n", " ID: 'bwa' PN: 'bwa' VN: '0.7.15-r1140' CL: bwa mem -t 3 -M -R @RG\\tID:Sample_75641\\tSM:Sample_75641\\tLB...Sample_75641_2.fq.gz \n", " ID: 'GATK IndelRealigner' VN: '3.8-1-0-gf15c1c3ef' CL: knownAlleles=[(RodBinding name=knownAlleles source=/home/gao...ileForDebugging=null \n", " ID: 'GATK ApplyBQSR' VN: '4.0.3.0' CL: ApplyBQSR --output /home/gaow/Documents/GIT/wiki/sos-docs/s...se\tPN:GATK ApplyBQSR \n", " ID: 'MarkDuplicates' VN: 'Version:4.0.3.0' CL: MarkDuplicates --INPUT /home/gaow/Documents/GIT/wiki/sos-do...se\tPN:MarkDuplicates \n", "SQ:\n", " SN: 'chr36' LN: 30001 M5: '68869954ee006c7821633a97b0a62430' UR: file:/home/gaow/Documents/GIT/wiki/sos-docs/src/examples/WGS...SoS/ref/reference.fa \n", " SN: 'chr37' LN: 30001 M5: 'af868c903d4fa5e8484cff65c55a6368' UR: file:/home/gaow/Documents/GIT/wiki/sos-docs/src/examples/WGS...SoS/ref/reference.fa \n", " SN: 'chr38' LN: 30001 M5: '357ef977363e01badd94e19051529270' UR: file:/home/gaow/Documents/GIT/wiki/sos-docs/src/examples/WGS...SoS/ref/reference.fa \n", " SN: 'chrX' LN: 30001 M5: 'f7a40f22108352fa28a6f721315ad064' UR: file:/home/gaow/Documents/GIT/wiki/sos-docs/src/examples/WGS...SoS/ref/reference.fa \n" ] } ], "source": [ "%preview WGS_Call_SoS/test_samples_bam/Sample_75641_1.fixmate.sorted.cleaned.reordered.realigned.recal.dedup.bam" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Variant Calling\n", "To convert your BAM file into genomic positions we first use mpileup to produce a BCF file that contains all of the locations in the genome. We use this information to call genotypes and reduce our list of sites to those found to be variant by passing this file into bcftools call." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Multi-sample variant calling -- pileup\n", "[8]\n", "# the 2nd line can have many samples separated by space\n", "input: group_by = 'all'\n", "output: f'{wd:a}/{samples:bn}_vcf/{samples:bn}.bcf'\n", "run: container='gaow/debian-ngs:1.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'\n", " bcftools mpileup -Ob -f {ref_genome} --threads {ncpu} -o {_output} {_input}" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Multi-sample variant calling -- call\n", "[9]\n", "output: f'{_input:n}.vcf.gz'\n", "run: container='gaow/debian-ngs:1.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'\n", " bcftools call -{\"v\" if variants_only else \"\"}mO z --threads {ncpu} -o {_output} {_input} \\\n", " && tabix -p vcf {_output}" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "Additionally you may find it helpful to prepare graphs and statistics to assist you in filtering your variants:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Quality summary plots for VCF file\n", "[10]\n", "output: f'{_input:n}.stats'\n", "run: container='gaow/debian-ngs-gatk4:1.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'\n", " bcftools stats -F {ref_genome} -s - {_input} > {_output}\n", " mkdir -p {_input:nn}_plots\n", " plot-vcfstats -P -p {_input:nn}_plots {_output}" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "To view one of the summary plot in SoS," ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "kernel": "SoS" }, "outputs": [ { "data": { "text/html": [ "
%preview WGS_Call_SoS/test_samples_vcf/test_samples_plots/substitutions.0.pdf
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
> WGS_Call_SoS/test_samples_vcf/test_samples_plots/substitutions.0.pdf (9.8 KiB):
" ], "text/plain": [ "\n", "> WGS_Call_SoS/test_samples_vcf/test_samples_plots/substitutions.0.pdf (9.8 KiB):" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAVoAAAE7CAMAAACrJ7m7AAAJJGlDQ1BpY2MAAHjalZVnUJNZF8fv8zzphUASQodQQ5EqJYCUEFoo0quoQOidUEVsiLgCK4qINEUQUUDBVSmyVkSxsCgoYkE3yCKgrBtXERWUF/Sd0Xnf2Q/7n7n3/OY/Z+4995wPFwCCOFgSvLQnJqULvJ3smIFBwUzwg8L4aSkcT0838I96Pwyg5XhvBfj3IkREpvGX4sLSyuWnCNIBgLKXWDMrPWWZDy8xPTz+K59dZsFSgUt8Y5mjv/Ho15xvLPqa4+vNXXoVCgAcKfoHDv+B/3vvslQ4gvTYqMhspk9yVHpWmCCSmbbcCR6Xy/QUJEfFJkT+UPC/Sv4HpUdmpy9HbnLKBkFsdEw68/8ONTIwNATfZ/HW62uPIUb//85nWd+95HoA2LMAIHu+e+GVAHTuAED68XdPbamvlHwAOu7wMwSZ3zzU8oYGBEABdCADFIEq0AS6wAiYAUtgCxyAC/AAviAIrAN8EAMSgQBkgVywDRSAIrAH7AdVoBY0gCbQCk6DTnAeXAHXwW1wFwyDJ0AIJsArIALvwTwEQViIDNEgGUgJUod0ICOIDVlDDpAb5A0FQaFQNJQEZUC50HaoCCqFqqA6qAn6BToHXYFuQoPQI2gMmob+hj7BCEyC6bACrAHrw2yYA7vCvvBaOBpOhXPgfHg3XAHXwyfgDvgKfBsehoXwK3gWAQgRYSDKiC7CRriIBxKMRCECZDNSiJQj9Ugr0o30IfcQITKDfERhUDQUE6WLskQ5o/xQfFQqajOqGFWFOo7qQPWi7qHGUCLUFzQZLY/WQVugeehAdDQ6C12ALkc3otvR19DD6An0ewwGw8CwMGYYZ0wQJg6zEVOMOYhpw1zGDGLGMbNYLFYGq4O1wnpgw7Dp2AJsJfYE9hJ2CDuB/YAj4pRwRjhHXDAuCZeHK8c14y7ihnCTuHm8OF4db4H3wEfgN+BL8A34bvwd/AR+niBBYBGsCL6EOMI2QgWhlXCNMEp4SyQSVYjmRC9iLHErsYJ4iniDOEb8SKKStElcUggpg7SbdIx0mfSI9JZMJmuQbcnB5HTybnIT+Sr5GfmDGE1MT4wnFiG2RaxarENsSOw1BU9Rp3Ao6yg5lHLKGcodyow4XlxDnCseJr5ZvFr8nPiI+KwETcJQwkMiUaJYolnipsQUFUvVoDpQI6j51CPUq9RxGkJTpXFpfNp2WgPtGm2CjqGz6Dx6HL2IfpI+QBdJUiWNJf0lsyWrJS9IChkIQ4PBYyQwShinGQ8Yn6QUpDhSkVK7pFqlhqTmpOWkbaUjpQul26SHpT/JMGUcZOJl9sp0yjyVRclqy3rJZskekr0mOyNHl7OU48sVyp2WeywPy2vLe8tvlD8i3y8/q6Co4KSQolCpcFVhRpGhaKsYp1imeFFxWommZK0Uq1SmdEnpJVOSyWEmMCuYvUyRsryys3KGcp3ygPK8CkvFTyVPpU3lqSpBla0apVqm2qMqUlNSc1fLVWtRe6yOV2erx6gfUO9Tn9NgaQRo7NTo1JhiSbN4rBxWC2tUk6xpo5mqWa95XwujxdaK1zqodVcb1jbRjtGu1r6jA+uY6sTqHNQZXIFeYb4iaUX9ihFdki5HN1O3RXdMj6Hnppen16n3Wl9NP1h/r36f/hcDE4MEgwaDJ4ZUQxfDPMNuw7+NtI34RtVG91eSVzqu3LKya+UbYx3jSONDxg9NaCbuJjtNekw+m5qZCkxbTafN1MxCzWrMRth0tie7mH3DHG1uZ77F/Lz5RwtTi3SL0xZ/Wepaxls2W06tYq2KXNWwatxKxSrMqs5KaM20DrU+bC20UbYJs6m3eW6rahth22g7ydHixHFOcF7bGdgJ7Nrt5rgW3E3cy/aIvZN9of2AA9XBz6HK4ZmjimO0Y4ujyMnEaaPTZWe0s6vzXucRngKPz2viiVzMXDa59LqSXH1cq1yfu2m7Cdy63WF3F/d97qOr1Vcnre70AB48j30eTz1Znqmev3phvDy9qr1eeBt653r3+dB81vs0+7z3tfMt8X3ip+mX4dfjT/EP8W/ynwuwDygNEAbqB24KvB0kGxQb1BWMDfYPbgyeXeOwZv+aiRCTkIKQB2tZa7PX3lwnuy5h3YX1lPVh68+EokMDQptDF8I8wurDZsN54TXhIj6Xf4D/KsI2oixiOtIqsjRyMsoqqjRqKtoqel/0dIxNTHnMTCw3tir2TZxzXG3cXLxH/LH4xYSAhLZEXGJo4rkkalJ8Um+yYnJ28mCKTkpBijDVInV/qkjgKmhMg9LWpnWl05c+xf4MzYwdGWOZ1pnVmR+y/LPOZEtkJ2X3b9DesGvDZI5jztGNqI38jT25yrnbcsc2cTbVbYY2h2/u2aK6JX/LxFanrce3EbbFb/stzyCvNO/d9oDt3fkK+Vvzx3c47WgpECsQFIzstNxZ+xPqp9ifBnat3FW560thROGtIoOi8qKFYn7xrZ8Nf674eXF31O6BEtOSQ3swe5L2PNhrs/d4qURpTun4Pvd9HWXMssKyd/vX779Zblxee4BwIOOAsMKtoqtSrXJP5UJVTNVwtV11W418za6auYMRB4cO2R5qrVWoLar9dDj28MM6p7qOeo368iOYI5lHXjT4N/QdZR9tapRtLGr8fCzpmPC49/HeJrOmpmb55pIWuCWjZfpEyIm7J+1PdrXqtta1MdqKToFTGade/hL6y4PTrqd7zrDPtJ5VP1vTTmsv7IA6NnSIOmM6hV1BXYPnXM71dFt2t/+q9+ux88rnqy9IXii5SLiYf3HxUs6l2cspl2euRF8Z71nf8+Rq4NX7vV69A9dcr9247nj9ah+n79INqxvnb1rcPHeLfavztuntjn6T/vbfTH5rHzAd6Lhjdqfrrvnd7sFVgxeHbIau3LO/d/0+7/7t4dXDgw/8HjwcCRkRPox4OPUo4dGbx5mP559sHUWPFj4Vf1r+TP5Z/e9av7cJTYUXxuzH+p/7PH8yzh9/9UfaHwsT+S/IL8onlSabpoymzk87Tt99ueblxKuUV/MzBX9K/FnzWvP12b9s/+oXBYom3gjeLP5d/Fbm7bF3xu96Zj1nn71PfD8/V/hB5sPxj+yPfZ8CPk3OZy1gFyo+a33u/uL6ZXQxcXHxPy6ikLxyKdSVAAAAIGNIUk0AAHomAACAhAAA+gAAAIDoAAB1MAAA6mAAADqYAAAXcJy6UTwAAAGnUExURf///93d3Xd3d7u7uwAAAO7u7oiIiMzMzDMzM0RERJmZmVVVVREREaqqqmZmZiIiIqOjoyAgIK+vrzIyMv93d/9ERP/dd//RRP8AAP/AACkpKf/3zP/gRP/umf/0u//VAP/pd/9TRP+hmf8VAP+Cd/+7d/+iRP/3mf/wRP/7zP+AAP/0d//qAP/5u/+qmf9jRP/UzP+Zd/9zRP/Dmf+SRP/hzP+xRP/Vmf/dmf/BRP/uzP+Nd/8qAP/Gu/9AAP+wd/9qAP/Xu/+VAP/Gd//Sd/+qAP/ou7syMrsAALsPALtfV7tnV7sfALuRibtUMrsvALuBV7tOALueibt3MrteALttALuRV7uaV7t9ALuqibuZMruNALuzibucALurV7uzV7usALu3iSsrKxMTE0QpKUQgICAPD0QjICAQD0QzMkQ1MkQmICkXE0Q9O0QwKUQpIBILCUQ4MiAaF0Q+O0Q5MkQvICkcE0Q/O0Q2KUQyIBINCUQ1ICAZD0Q9MkQ+MkQ4ICAaD0RBO0Q9KUQ7IBIQCURCO0Q+ICAdD0RBMkRCMkRBICAeD0RDO4UJMhgAAAABYktHRASPaNlRAAAACXBIWXMAAABQAAAAUAArV3ORAAAAB3RJTUUH4goeExk3UPM5GgAAC8dJREFUeNrtnYe76zYZh21L8opi33sP7WVTVtmr7D3K3nuUPcoqUFaBsgplF/5oJK98lr/TJLa/5NzT3/s89/pEsRTllSLrUxw7igAAAAAAAAAAAAAAAAAAAAAA4LoTJ0xiEp+7WtcBpZlErc5drWuASbVSro9mSZJFvg/7rdKpys5dszueLNdFUUaJ+1+nUabzTVFEhc6LZHnZT3WaASHTZRSlNkpy98BgQFiHRm3iRgW11VGpq60zC7Wr0Kq1hSeKyly7cQFqV6FRm2rTPzaVhtp1UF5jbCsTmTLy84LUq8VBbA0K7USqWmtdO61a29KPvbo4d72uD7Gf3UZGKbO4KADORNbMGRA5COCGBg/iXQDOwY2bBei5eWNNtbcuFOi5uLWmWoU5+o5i1UAbaglQKwbUigG1YkCtGIvVZn6e0UeSUEtYrLbQjk33AGoJawwIRm+7v6CWsIbatO7/glrCGmrrdj3UDw03o6fdFXDud3gwd98ec/fC8lZQm+nhvDXXa+96esC5jR3M7WeMub2wvBXUVrtRAGoJy9Ua/+1pB9QSlqst7e5vqCWsHo1BbQ/UDkCtGFArBtSKAbViQK0YUCsG1IoBtWJArRhQKwbUigG1YkCtGFArBtSKAbViQK0YUCsG1IoBtWJArRhQK8YVVGvUUAbUEparTXRR9z/dhlrCYrVKZ82ln9oHULtjsdqqUiVOAuVY/lsGm2/aE+7bU5ehdjCzWG3thtv+DFv0WmpmqdqN07rtzwOFWsJitaU10Yb8uAlqe5ZPvipb1+QneVDbs0bIsLtqDtQSEOgOQK0YUCsG1IoBtWJArRhQKwbUigG1YkCtGFArBtSKAbViQK0YUCsG1IoBtWJArRhQKwbUigG1YkCtGFArxpVT29zBp38AtYQVzq91DA+gdsdytcX4AdT2LFebJ7vb/UEtYQW1RV03bnHq8phVLsOOn4lwrDL5wvm1HMtveeHvDZx2D6CWsMItL+rdnYChlppZPCBk5H7AUEtAoDsAtWJArRhQKwbUigG1YkCtGFArBtSKAbViQK0YUCsG1IoBtWJArRhQKwbUigG1YkCtGFArBtSKcQXUxv7qlOQKVASoJcxQm/jTkEYnKA5ALWGu2pSozYYeDLWE49Xqlt094GOLs2c4ZvTasgyK2EAtx6wZQqyUGgaBtFBQyzFDran9gNDrjG3cqU2K4uKZ10zts54d8JzDy5tzGCsMLSCJrnGvfe7zAu45vLwZatOEPmoPat0DqCXMGhC25JEbdtO8LwRqCXMGBE3H2ijCgMCDNYQBqBXjCqidDAg7oJYws9fGVcklQy1h7oBgai4Vaglz1aZQu4+5Y63FgLCPOd8yuCgh5p+CWsKsAcGojH8Caglz1PoRoX5qfDd2YrWlVZGpNtxTUEuY+92Y0dxTUEuY02tz4yZfOfcU1BLmLCrmtsjtlnsKagmzZggqKfnZF9QSjlebNVaV4Z6DWsLRamPdqC0S7kmoJRytNmlnXQqHsX0cr7btrhkmX/s4Wm3ZdtctlsL3cfxYaxN3BFPD9adGQC3h+BnC1mqrNXsUg1rKnJBBJVusfO0H3+gOXD212e6iy1BLWaw2sUVd92Ev1BKWXyvc/bu+F1k991hr6mt7JdAzj7W5bmPfJzl1+fkBLzi3xjtDrVHl3uvXvjDgRefWyHH11Ea+v3Z/QC1hjasum3zfYQxq57DxV13u18Whdk21UXzAVZehdjFQS4DaAagVA2rFgFoxoFYMqBUDasWAWjGgVgyoFQNqxYBaMaBWDKgVA2rFgFoxoFYMqBUDasWAWjGgVgyoFePqqT3k1GWoncNGFzbH2TMSarcmMsM1fqB2TbWevacuQ+1MUtsMCE9y6vKJ1L743gAZtS95acDLpNSWu6vdn7nX3vvyABm1r3hlwKuE1BKzULuqWmoWaldVW/jLAF6Rs8KvmdoRUAu1UAu1UAu1UAu1UAu1UAu1UAu1UAu1UAu1UAu1UAu1UAu1UAu1UAu1UAu1UAu1i8h2Fw6H2jXVZvRuxVC7plqjEqiVUUsusQq1UmqPPHX51a8JeK1LfN19AZe96OvfEPDGy9S+KeDNLu0tbw1429VWGx3Xa9/+joB3usT73hVw2Yu++z0B771M7fsC7ndp7/9AwAehFmqhdg21au6py1B7nGeohVqohVqohVqohVqohVqohVqohVqohVqohVqohVqohVqohVqohVqohVqohVqohdpFPudddRlq92Hyup511WWo3Udam6jGiUkSaouEnFB360J96MMByvGRgI+6tI99POATLvGTnwpQl/DpzwR81iV+7vMBfs8vBHzRpX3pywFfcYkPfHXMAy7ta18P+IZL/Oa3Ar7NVvHi1gpqy9r/lRTFd75bfO/Blu932wcLxw86fthtf+TSfvxQy0+67UM/dYk/e7jl5932YZ/7Fx2/7La/cmm/fqTlN932kd+6xN892vL7bvuoz/2Hjj922z+5tD8/1vKXbvvYX13i439r+Xu3fdyl/eOfHf/qtv92if95ouW/3faJ/xUcN2+soDbNmXR2ZyYt4XbUp8ldrJ97RUYDwr7XhdpjKGtjhivc73tdqD0Gs7G7W14QMm5nLjHmdlSnyZ2tn3tV4nh5GQAAMFAuGreW5c7KJbnPhCmZ49uWOaiX9lA5y3JzFcrsoWq3zMtkiTkw96oom09q7SYUyeSwx7nJcqbSB+c2RcUcXLkKcWbLnLEd5/W0WRNdrboqeyBsd6j8RC0buyg1M3tziXai5+Dcsc71JnzTXIUyyzRMZes6bFiTV9N307Sq2Z7OaceG+exm2kzeTmlVztipS1UEfeKI3FUVV7oYq2Qq5GxXU7exjss6+HxwwXxkt965PvlgXbdaYvq+SxfBGJvm9O0Umave1E5SR6bOdZHNyq10HG1qW9MexVQoKX0fnbgtKteMuab9tOrahe4b62bf6uRui7YyKa2g8v0uJvfV62DsGLt1n8E4nZfbvbobPUxqRknTCjXWJm6VVe41FG2XpO21WU3309smHiyYsUKUUvteYuqRh7oJvLd2InJqp9KTGh+RmxmC2QrxbutJR4zbHyJvqnFG/xqxPfmhrNKpiYvxIkZm8yxKR3UxzbTU5OHRINaTGh+RO2KO50yFsuZTUYX7lpPWcw1TxSYZp5vcjbaqZteIBHGVTvyPzoM6ZrnWNXHjaqsTNne02UyTD8sdu4Egtfsr5A6UU4dNY9l0UuLWal0Ecxaz0doy636SdJUO+p1/y6PlnDixRakNn1tV83K7uYHr8KYw0yJHFSoLm0xmDV1jJeW0xCgbv1TTBkZ+/WsEW+m+goTUz66qatXcyo3RSb6/SFPXbnjQ427INhZb4mUfGFnYSrMVNHEziZyTOy7Z3FETrwVHKr5I391ow5iEaaysZEu87AMjTljpy95yw2hHH+RPcrvYfZI77Sa8Yw9uHPWz1rKmJZaMxbItiur24daksXwAF5bIt8HpmPSRsIJ9nG7oxKcP8mnuJnaf5O5yjXIPAT1pg6yP/GmRQ2KyO1AOgSwVNoTGtFXZNpBnWFQhlY6qPpqnFezjdEPGq+GdcG95lNt2cy2aewjolQ1ybw1TpE8s1S5NT5t6iKlJiXwbnIBhUaUkR5y6bjsjrSAbpw8HGpJ7iN1p7rj7anB0dB4C+iwd526ih7DIcUjhfHV3rSWNtVtgICWybXASmEUV/6VvYt2BJBtNAJk4nVt1GGL3UW7d9toN6chsQO9zu5h1/MUok7itgjsCe7gFBrYNTgOzqNLE/Kl1XXS0JxOnc0E+G7t3EWY2Gu24gL7JHTeLNXsSI8Zt+0ojt2wbnAZmUcWfAuLHwfBrlmmczgX5bOzORphcQN/lHp+DwiYe6vbS/eRhFlXce+HGJSZO51YduNidjTDZgJ7NzRfpn+AWZavpJ/88bplFFVc79kQTJvjmVh242D3iIkw2oGdzX1Lk4bBtIA6zqOIGRe5omnBNz63PZYdGPWxAz+Y+uMirhOKme5sTT1QAAOBO4/8e4yHspMjXAgAAACV0RVh0ZGF0ZTpjcmVhdGUAMjAxOC0xMC0zMFQxNDoyNTo1NS0wNTowMO1r0oQAAAAldEVYdGRhdGU6bW9kaWZ5ADIwMTgtMTAtMzBUMTQ6MjU6NTUtMDU6MDCcNmo4AAAAKHRFWHRwZGY6SGlSZXNCb3VuZGluZ0JveAAzMTEuODExeDI4My40NjUrMCswh03ywAAAABR0RVh0cGRmOlZlcnNpb24AUERGLTEuNCAcRzp4AAAAAElFTkSuQmCC" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%preview WGS_Call_SoS/test_samples_vcf/test_samples_plots/substitutions.0.pdf -s png --dpi 80" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "Finally you will probably need to filter your data using commands such as:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Quality filtering\n", "[11]\n", "output: f'{_input:nn}.filtered.vcf.gz'\n", "run: container='gaow/debian-ngs:1.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'\n", " bcftools filter -O z -o {_output} \\\n", " -s LOWQUAL -i'{bcftools_filter}' {_input:nn}.vcf.gz \\\n", " && tabix -p vcf {_output}" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "Variant filtration is a subject worthy of an article in itself and the exact filters you will need to use will depend on the purpose of your study and quality and depth of the data used to call the variants." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "# Only extract regions we are interested in\n", "[12]\n", "stop_if(not keep_coord.is_file(), msg = 'Skip extracting selected SNPs because no list (--keep_coord coord.list) was provided to extract from.')\n", "output: f'{_input:nn}.extracted.vcf.gz'\n", "run: container='gaow/debian-ngs:1.0', volumes=[f'{wd:a}:{wd:a}'], expand=True, stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'\n", " tabix -H {_input} > {_output:n}\n", " tabix {_input} -R {keep_coord} >> {_output:n}\n", " bgzip {_output:n}" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## References\n", "- The 1000 Genomes Project Consortium - An Integrated map of genetic variation from 1092 human genomes Nature 491, 56–65 (01 November 2012) doi:10.1038/nature11632\n", "- [GATK Best Practices](http://www.broadinstitute.org/gatk/guide/best-practices)" ] } ], "metadata": { "kernelspec": { "display_name": "SoS", "language": "sos", "name": "sos" }, "language_info": { "codemirror_mode": "sos", "file_extension": ".sos", "mimetype": "text/x-sos", "name": "sos", "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", "pygments_lexer": "sos" }, "sos": { "default_kernel": "SoS", "kernels": [ [ "SoS", "sos", "", "" ] ], "panel": { "displayed": true, "height": 0, "style": "side" }, "version": "0.20.5" } }, "nbformat": 4, "nbformat_minor": 4 }