% vtools show pipeline bwa_gatk28_hg19
A pipeline to align raw reads from fastq or BAW/SAM files using BWA and GATK
best practice. It uses hg19 of human reference genome and assumes paired-end
reads in plain text and compressed formats.
Available pipelines: align, call
Pipeline "align": Align raw reads from input files using bwa, gatk, and
picard. This pipeline accepts raw input files in plain text format, SAM/BAM
format, and their compressed versions (.zip, .tar.gz, .tgz, .bz2, .tbz2 etc).
All input files are assumed to be raw reads from the same sample. This
pipeline generates a calibrated bam file (--output), and its reduced version
if an additional output file is specified.
align_0: Check the version of variant tools (version 2.1.1 and
above is required to execute this pipeline)
align_1: Download required resources to resource directory
align_10: Check existence of commands bwa, samtools and java
align_11: Check the version of bwa. Version is 0.7.4 is
recommended
align_12: Check the version of picard. Version is 1.82 is
recommended.
align_13: Check the version of GATK. Version 2.8 is recommended.
align_20: Check existence of class files for Picard and GATK
align_30: Build bwa index for build hg19 of reference genome
align_40: Build samtools index for build hg19 of reference genome
align_100: Convert bam files to paired fastq files if the input is
in bam/sam format. Other input files are returned
untouched.
align_101: Decompress all input files (.tgz2, .tar, .tar.gz, .gz,
.tgz, .zip etc) to a cache directory. Uncompressed files
are hard-linked to the cache directory.
align_200: Check the format of the input fastq file and write an
option file with option -I if the sequences are in
Illumina 1.3+ format.
align_201: Call bwa aln to produce .sai files
align_300: Determine (guess) a read group tag and write to a .RG
file.
align_301: Running bwa sampe for paired end reads, using read group
tag saved in a .RG file
align_302: Check the proportion of aligned reads and exit if there
are less than 80% of aligned reads.
align_303: If in production mode, remove fastq files dumped from
bam files
align_400: Merge per-lane sam files into a single bam file. This
step is skipped if there is only one input file.
align_500: Sort merged bam file using picard SortSam
align_501: If in production mode, remove decompressed fastq and
individual bam files after a single bam file has been
produced.
align_600: Mark duplicates using picard MarkDuplicates command
align_601: Remove _sorted.bam file after deduplication is
completed.
align_610: Index dedupped bam file using samtools
align_700: Realign indels create indel realigner target
align_710: Apply indel realigner target to bam file
align_711: If in production mode, remove bam files before
realignment steps
align_800: Create base recalibrator target
align_810: Apply base recalibrator target
align_811: If in production mode, remove bam files before
realignment steps
align_900: Generated bam file with reduced reads if more than one
output file is specified
Pipeline "call": Call and validate variants (SNPs and Indels) from one or
more input bam files. This pipeline accepts one or more bam files as input and
a vcf file as output. If multiple input files are specified, multi-sample
variant calling will be performed.
call_0: Check the version of variant tools (version 2.1.1 and
above is required to execute this pipeline)
call_10: Check existence of commands bwa, samtools and java
call_13: Check the version of GATK
call_20: Check existence of class files for GATK
call_100: Use unified genotyper to get raw variants
call_200: Recalibrate SNPs
call_210: Recalibrate INDELs
call_300: Apply SNP recalibration
call_310: Apply INDEL recalibration
Pipeline parameters:
name Name of the job to be executed. All intermediate
files generated from this pipeline will be saved
to \\(CACHE_DIR/$NAME where \\(CACHE_DIR is the
cache directory of the project. (default:
bwa_gatk28_hg19)
strict_prog_version Whether or not use other version of programs if
the exact version does not exist. Setting it to
False allows variant tools to use other versions
of the programs. (default: False)
production If set to True or 1, all intermediate files will
be removed. The whole pipeline would need to be
rerun if a different parameter or different
version of external program is used. (default:
False)
bwa path to bwa. Default to 'bwa' (use \\(PATH to
determine actual path) (default: bwa)
samtools path to samtools. Default to 'samtools' (use
\\(PATH to determine actual path) (default:
samtools)
java path to java. Default to 'java' (use \\(PATH to
determine actual path) (default: java)
picard_path Path to picard jar files
gatk_path Path to GATK jar file GenomeAnalysisTK.jar
opt_java Option to java program. -Djava.io.tmpdir is
frequently used to set java temporary directory
if system temporary partition is not big enough.
(default: -Xmx4g -XX:-UseGCOverheadLimit)
opt_bwa_index Option to command 'bwa index'
opt_bwa_aln Option to command 'bwa aln'
opt_bwa_sampe Option to command 'bwa sampe'
opt_samtools_faidx Option to command 'samtools faidx'
opt_samtools_index Option to command 'samtools index'
opt_picard_sortsam Option to picard SortSam.jar (default:
VALIDATION_STRINGENCY=LENIENT)
opt_picard_mergesamfiles Option to picard MergeSamFiles.jar
(default: MAX_RECORDS_IN_RAM=5000000)
opt_picard_samtofastq Option to picard SamToFastq.jar (default:
VALIDATION_STRINGENCY=LENIENT NON_PF=true)
opt_picard_markduplicates Option to picard MarkDuplicates.jar
opt_gatk_realignertargetcreator Option to command gatk
RealignerTargetCreator
opt_gatk_indelrealigner Option to command gatk IndelRealigner
(default: --filter_mismatching_base_and_quals)
opt_gatk_baserecalibrator Option to command gatk BaseRecalibrator
(default: -rf BadCigar)
opt_gatk_printreads Option to command gatk PrintReads
opt_gatk_reducereads Option to command gatk ReduceReads
opt_gatk_unifiedgenotyper Option to command gatk UnifiedGenotyper
(default: -rf BadCigar -stand_call_conf 50.0
-stand_emit_conf 10.0 -dcov 200 -A
AlleleBalance -A QualByDepth -A HaplotypeScore
-A MappingQualityRankSumTest -A
ReadPosRankSumTest -A FisherStrand -A
RMSMappingQuality -A InbreedingCoeff -A
Coverage)
opt_gatk_variantrecalibrator Option to command gatk
VarintRecalibrator
opt_gatk_variantrecalibrator_snp Option to command gatk
VarintRecalibrator -mode SNP (default:
-percentBad 0.01 -minNumBad 1000 -an QD -an
MQRankSum -an ReadPosRankSum -an FS -an DP)
opt_gatk_variantrecalibrator_indel Option to command gatk
VarintRecalibrator -mode INDEL (default:
--maxGaussians 4 -percentBad 0.01 -minNumBad
1000 -an DP -an FS -an ReadPosRankSum -an
MQRankSum)
opt_gatk_applyrecalibration Option to command gatk ApplyRecalibrator
(default: --ts_filter_level 99.9)
opt_gatk_applyrecalibration_snp Option to command gatk
ApplyRecalibrator -mode SNP
opt_gatk_applyrecalibration_indel Option to command gatk
ApplyRecalibrator -mode INDEL
This pipeline uses the following external commands:
You can add path to bwa and samtools to $PATH
or pass full path name to options bwa=/path/to/bwa
and samtools=/path/to/samtools
. Paths to Picard
and GATK
should be passed using options gatk_path
and picard_path
.
After you installed the programs, you should running the following commands to test if everything works ok:
# create a new project and download test data (an online snapshot)
% vtools init test --parent vt_illuminaTestData
% vtools execute bwa_gatk28_hg19 align --input illumina_test_seq.tar.gz --output test.bam \
--gatk_path /path/to/gatk --picard_path /path/to/picard --strict_prog_version False
The pipeline will by default keep all intermediate files. If you restart the pipeline with different parameter or a different version of an external file, only the affected steps will be repeated. Intermediate files will be reused if available. This allows you to examine and fine-tune the pipeline to make sure it works as expected.
Because intermediate files can be large, an option --production true
is provided to execute the pipeline in production mode. In this mode, most intermediate files will be removed after the completion of the steps that make use of them. The pipeline can resume from the right step if it is interrupted, but has to be re-executed from the beginning if a result file is removed.