This tutorial explains the concepts of variant tools and demonstrates, through examples, how to use variant tools to import, select, and annotate genetic variants. You will need to have variant tools installed (Linux or Mac OSX) to follow this tutorial. Please also download sample data from here .
--help
, `vtools show
)Getting details of commands
\\( vtools -h
\\( vtools init -h
The variant tools website has detailed explanation and examples for all commands, utilities and pipelines.
Getting details of file formats, fields, pipelines, association tests, file tracks, runtime options, annotation databases, snapshots, pipelines and simulation models.
Getting a list of all supported file formats:
\\( vtools show formats -v0
CASAVA18_snps
CASAVA18_indels
plink
rsname
ANNOVAR_output
ANNOVAR
pileup_indel
ANNOVAR_exonic_variant_function
ANNOVAR_variant_function
twoalleles
map
polyphen2
basic
vcf
CGA
csv
tped
and details of a particular format:
\\( vtools show format basic
A basic variant import/export format that import variants with four tab-
delimited columns (chr, pos, ref, alt), and export variants, optional variant
info fields and genotypes.
Columns:
1 chromosome
2 variant position, set --pos_adj to -1 to export
variants in 0-based positions.
3 reference allele
4 alternative allele
5 Output variant info fields as one column
6 genotype in numeric style
Formatters are provided for fields: gt
variant:
chr Chromosome
pos 1-based position, set --pos_adj to 1 if input
position is 0 based.
ref Reference allele, '-' for insertion.
alt Alternative allele, '-' for deletion.
Format parameters:
chr_col Column index for the chromosome field (default: 1)
pos_col Column index for the position field (default: 2)
ref_col Column index for the reference field (default: 3)
alt_col Column index for the alternative field (default: 4)
pos_adj Set to 1 to import variants with 0-based positions,
or to -1 to export variants in 0-based positions.
(default: 0)
fields Fields to output, simple arithmetics are allowed
(e.g. pos+1) but aggregation functions are not
supported. (default: )
init
)Create an empty project
\\( vtools init tutorial
INFO: variant tools 2.4.0 : Copyright (c) 2011 - 2014 Bo Peng
INFO: San Lucas FA, Wang G, Scheet P, Peng B (2012) Bioinformatics 28(3):421-422
INFO: Please visit http://varianttools.sourceforge.net for more information.
INFO: Creating a new project tutorial
You cannot create a project in a folder with another project
\\( vtools init tutorial
ERROR: A project can only be created in a directory without another project.
but you can use option --force
to override this
\\( vtools init tutorial -f
INFO: variant tools 2.4.0 : Copyright (c) 2011 - 2014 Bo Peng
INFO: San Lucas FA, Wang G, Scheet P, Peng B (2012) Bioinformatics 28(3):421-422
INFO: Please visit http://varianttools.sourceforge.net for more information.
INFO: Creating a new project tutorial
`vtools import
)Let us have a look at the data
\\( gzcat CEU.vcf.gz | head -10
##fileformat=VCFv4.0
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=HM2,Number=0,Type=Flag,Description="HapMap2 membership">
##INFO=<ID=HM3,Number=0,Type=Flag,Description="HapMap3 membership">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ancestral_alignments/README">
##reference=human_b36_both.fasta
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=CB,Number=1,Type=String,Description="Called by S(Sanger), M(UMich), B(BI)">
##rsIDs=dbSNP b129 mapped to NCBI 36.3, August 10, 2009
bpeng1@BCBMC02MG1WJF6T:~/temp\\) gzcat CEU.vcf.gz | head -14
##fileformat=VCFv4.0
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=HM2,Number=0,Type=Flag,Description="HapMap2 membership">
##INFO=<ID=HM3,Number=0,Type=Flag,Description="HapMap3 membership">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ancestral_alignments/README">
##reference=human_b36_both.fasta
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=CB,Number=1,Type=String,Description="Called by S(Sanger), M(UMich), B(BI)">
##rsIDs=dbSNP b129 mapped to NCBI 36.3, August 10, 2009
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA06985 NA06986 NA06994 NA07000 NA07037 NA07051 NA07346 NA07347 NA07357 NA10847 NA10851 NA11829 NA11830 NA11831 NA11832 NA11840 NA11881 NA11894 NA11918 NA11919 NA11920 NA11931 NA11992 NA11993 NA11994 NA11995 NA12003 NA12004 NA12005 NA12006 NA12043 NA12044 NA12045 NA12144 NA12154 NA12155 NA12156 NA12234 NA12249 NA12287 NA12414 NA12489 NA12716 NA12717 NA12749 NA12750 NA12751 NA12760 NA12761 NA12762 NA12763 NA12776 NA12812 NA12813 NA12814 NA12815 NA12828 NA12872 NA12873 NA12874
1 533 . G C . PASS AA=.;AC=6;AN=120;DP=423 GT:DP:CB 0|0:6:SMB 0|0:14:SMB 0|0:4:SMB 0|0:3:SMB 0|0:7:SMB 0|0:4:SMB 1|0:6:MB 0|0:3:SMB 0|0:13:SMB 0|0:1:SMB 0|0:14:SMB 0|0:10:SMB 0|0:6:SB 0|0:2:SMB 0|0:6:SMB 0|0:4:SMB 0|0:2:SMB 0|0:15:SMB 0|0:2:SMB 0|0:1:SMB 0|0:26:SMB 0|0:6:SMB 0|1:14:MB 0|0:5:SMB 0|0:3:SMB 0|0:20:SMB 0|0:3:SMB 0|0:2:SMB 0|0:4:SMB 0|0:12:SMB 0|0:1:SMB 0|0:7:SMB 0|0:2:SMB 0|0:25:SMB 0|0:9:SMB 0|1:1:MB 0|0:9:SMB 0|0:1:SMB 0|0:6:SMB 0|0:12:SMB 0|0:7:SMB 0|0:18:SMB 0|0:2:SMB 0|0:2:SM 0|0:38:SMB 0|0:3:SM 0|0:3:SMB 0|0:5:SMB 0|0:5:SMB 0|0:3:SMB 0|0:0:MB 0|0:5:SMB 0|0:7:SMB 0|0:0:SMB 0|0:6:SMB 1|0:5:SMB 0|0:4:MB 0|0:5:SMB 1|0:5:MB 0|1:9:SMB
If the compressed .vcf file is indexed (with .tbi
file), you can use vtools
to show its information
\\( vtools show track CEU.vcf.gz
Version VCF v4.0
Number of fields: 69
Header: (excluding INFO and FORMAT lines)
##reference=human_b36_both.fasta
##rsIDs=dbSNP b129 mapped to NCBI 36.3, August 10, 2009
Available fields (with type VARCHAR if unspecified or all=1):
0 (INTEGER) 1 if matched
chr (1, chrom) chromosome
pos (2, INTEGER) position (1-based)
name (3) name of variant
ref (4) reference allele
alt (5) alternative alleles
qual (6) qual
filter (7) filter
info (8, default) variant info fields
info.DP (INTEGER) Total Depth
info.HM2 (INTEGER, flag) HapMap2 membership
info.HM3 (INTEGER, flag) HapMap3 membership
info.AA Ancestral Allele, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ancestral_alignments/README
info.AC (INTEGER) Allele count in genotypes
info.AN (INTEGER) Total number of alleles in called genotypes
format (9) genotype format
NA06985 (10) genotype for sample NA06985
NA06985.GT Genotype for sample NA06985
NA06985.DP (INTEGER) Read Depth for sample NA06985
NA06985.CB Called by S(Sanger), M(UMich), B(BI) for sample NA06985
...
Looking for
Import data with one variant information field AA
,
\\( vtools import CEU.vcf.gz --var_info AA --build hg18
INFO: Importing variants from CEU.vcf.gz (1/1)
CEU.vcf.gz: 100% [=================================================] 300 16.8K/s in 00:00:00
INFO: 288 new variants (288 SNVs) from 300 lines are imported.
Importing genotypes: 100% [======================================] 18,000 9.0K/s in 00:00:02
Copying samples: 100% [==============================================] 79 78.9/s in 00:00:01
Show the status of the project
\\( vtools show
Project name: tutorial
Created on: Fri Sep 19 06:09:40 2014
Primary reference genome: hg18
Secondary reference genome:
Runtime options: verbosity=1
Variant tables: variant
Annotation databases:
available variant tables
\\( vtools show tables
table #variants date message
variant 288 Sep19 Master variant table
details of a variant table
\\( vtools show table variant
Name: variant
Description: Master variant table
Creation date: Sep19
Command:
Fields: variant_id, bin, chr, pos, ref, alt, AA
Number of variants: 288
all genotypes
\\( vtools show genotypes -l 10
sample_name filename num_genotypes sample_genotype_fields
NA06985 CEU.vcf.gz 287 GT
NA06986 CEU.vcf.gz 287 GT
NA06994 CEU.vcf.gz 287 GT
NA07000 CEU.vcf.gz 287 GT
NA07037 CEU.vcf.gz 287 GT
NA07051 CEU.vcf.gz 287 GT
NA07346 CEU.vcf.gz 288 GT
NA07347 CEU.vcf.gz 287 GT
NA07357 CEU.vcf.gz 287 GT
NA10847 CEU.vcf.gz 287 GT
Information fields for each variant
\\( vtools show fields
variant.chr (char) Chromosome name (VARCHAR)
variant.pos (int) Position (INT, 1-based)
variant.ref (char) Reference allele (VARCHAR, - for missing allele of an insertion)
variant.alt (char) Alternative allele (VARCHAR, - for missing allele of an deletion)
variant.AA (char)
Output variants in a specified variant table with specified info fields
\\( vtools output variant chr pos ref alt AA -l 10
1 533 G C .
1 41342 T A .
1 41791 G A .
1 44449 T C C
1 44539 C T T
1 44571 G C g
1 45162 C T c
1 52066 T C C
1 53534 G A G
1 75891 T C .
`vtools update
)Command update adds more variant info fields to the project.
\\( vtools update variant --from_file CEU.vcf.gz --var_info DP
INFO: Using primary reference genome hg18 of the project.
Getting existing variants: 100% [================================================] 288 197.1K/s in 00:00:00
INFO: Updating variants from CEU.vcf.gz (1/1)
CEU.vcf.gz: 100% [=================================================================] 300 9.4K/s in 00:00:00
INFO: Field DP of 288 variants are updated
A new field DP
is added,
\\( vtools show fields
variant.chr (char) Chromosome name (VARCHAR)
variant.pos (int) Position (INT, 1-based)
variant.ref (char) Reference allele (VARCHAR, - for missing allele of an insertion)
variant.alt (char) Alternative allele (VARCHAR, - for missing allele of an deletion)
variant.AA (char)
variant.DP (int)
\\( vtools output variant chr pos ref alt AA DP -l 10
1 533 G C . 423
1 41342 T A . 188
1 41791 G A . 192
1 44449 T C C 166
1 44539 C T T 131
1 44571 G C g 135
1 45162 C T c 166
1 52066 T C C 159
1 53534 G A G 243
1 75891 T C . 182
Count the number of alternative alleles (not genotypes), homozygotes, heterozygotes, and calculate minior allele frequency for each variant
\\( vtools update variant --from_stat 'num=#(alt)' 'hom=#(hom)' 'het=#(het)' 'maf=maf()'
Counting variants: 100% [==================================] 60 1.1K/s in 00:00:00
INFO: Adding variant info field num with type INT
INFO: Adding variant info field hom with type INT
INFO: Adding variant info field het with type INT
INFO: Adding variant info field maf with type FLOAT
Updating variant: 100% [=================================] 288 65.0K/s in 00:00:00
INFO: 288 records are updated
\\( vtools output variant chr pos ref alt num hom het maf -l 10
1 533 G C 6 0 6 0.05
1 41342 T A 29 3 23 0.241666666667
1 41791 G A 5 0 5 0.0416666666667
1 44449 T C 2 0 2 0.0166666666667
1 44539 C T 2 0 2 0.0166666666667
1 44571 G C 7 0 7 0.0583333333333
1 45162 C T 20 4 12 0.166666666667
1 52066 T C 18 1 16 0.15
1 53534 G A 18 0 18 0.15
We can also calculate sample statistics for a subset of samples (e.g. in cases and controls)
\\( vtools show samples -l 10
sample_name filename
NA06985 CEU.vcf.gz
NA06986 CEU.vcf.gz
NA06994 CEU.vcf.gz
NA07000 CEU.vcf.gz
NA07037 CEU.vcf.gz
NA07051 CEU.vcf.gz
NA07346 CEU.vcf.gz
NA07347 CEU.vcf.gz
NA07357 CEU.vcf.gz
NA10847 CEU.vcf.gz
(50 records omitted)
\\( vtools update variant --from_stat 'num12=#(alt)' 'hom12=#(hom)' 'het12=#(het)' \
'maf12=maf()' --samples "sample_name like 'NA12%'"
INFO: 34 samples are selected
Counting variants: 100% [==================================] 34 1.9K/s in 00:00:00
INFO: Adding variant info field num12 with type INT
INFO: Adding variant info field hom12 with type INT
INFO: Adding variant info field het12 with type INT
INFO: Adding variant info field maf12 with type FLOAT
Updating variant: 100% [=================================] 288 45.9K/s in 00:00:00
INFO: 288 records are updated
\\( vtools output variant chr pos num maf num12 maf12 -l 10
1 533 6 0.05 4 0.0588235294118
1 41342 29 0.241666666667 16 0.235294117647
1 41791 5 0.0416666666667 2 0.0294117647059
1 44449 2 0.0166666666667 2 0.0294117647059
1 44539 2 0.0166666666667 2 0.0294117647059
1 44571 7 0.0583333333333 7 0.102941176471
1 45162 20 0.166666666667 11 0.161764705882
1 52066 18 0.15 14 0.205882352941
1 53534 18 0.15 8 0.117647058824
1 75891 11 0.0916666666667 9 0.132352941176
\\( vtools show fields
variant.chr (char) Chromosome name (VARCHAR)
variant.pos (int) Position (INT, 1-based)
variant.ref (char) Reference allele (VARCHAR, - for missing allele of an
insertion)
variant.alt (char) Alternative allele (VARCHAR, - for missing allele of an
deletion)
variant.AA (char)
variant.DP (int)
variant.num (int) Created from stat "#(alt)" with type INT on Sep19
variant.hom (int) Created from stat "#(hom)" with type INT on Sep19
variant.het (int) Created from stat "#(het)" with type INT on Sep19
variant.maf (float) Created from stat "maf()" with type FLOAT on Sep19
variant.num12 (int) Created from stat "#(alt)" for samples ["sample_name like
'NA12%'"]with type INT on Sep19
variant.hom12 (int) Created from stat "#(hom)" for samples ["sample_name like
'NA12%'"]with type INT on Sep19
variant.het12 (int) Created from stat "#(het)" for samples ["sample_name like
'NA12%'"]with type INT on Sep19
variant.maf12 (float) Created from stat "maf()" for samples ["sample_name like
'NA12%'"]with type FLOAT on Sep19
`vtools phenotype
)\\( vtools show phenotypes -l 10
sample_name
NA06985
NA06986
NA06994
NA07000
NA07037
NA07051
NA07346
NA07347
NA07357
NA10847
(50 records omitted)
\\( head -5 phenotype.txt
sample_name aff sex BMI
NA06985 2 F 19.64
NA06986 1 M None
NA06994 1 F 19.49
NA07000 2 F 21.52
\\( vtools phenotype --from_file phenotype.txt
INFO: Adding phenotype aff of type INT
INFO: Adding phenotype sex of type VARCHAR(1)
INFO: Adding phenotype BMI of type FLOAT
WARNING: Value "None" is treated as missing in phenotype BMI
WARNING: 1 missing values are identified for phenotype BMI
\\( vtools show phenotypes -l 10
sample_name aff sex BMI
NA06985 2 F 19.64
NA06986 1 M .
NA06994 1 F 19.49
NA07000 2 F 21.52
NA07037 2 F 23.05
NA07051 1 F 21.01
NA07346 1 F 18.93
NA07347 2 M 19.2
NA07357 2 M 20.61
NA10847 2 M 14.6
(50 records omitted)
Phenotypes can be used to select samples.
\\( vtools show samples -l 10
sample_name filename aff sex BMI
NA06985 CEU.vcf.gz 2 F 19.64
NA06986 CEU.vcf.gz 1 M .
NA06994 CEU.vcf.gz 1 F 19.49
NA07000 CEU.vcf.gz 2 F 21.52
NA07037 CEU.vcf.gz 2 F 23.05
NA07051 CEU.vcf.gz 1 F 21.01
NA07346 CEU.vcf.gz 1 F 18.93
NA07347 CEU.vcf.gz 2 M 19.2
NA07357 CEU.vcf.gz 2 M 20.61
NA10847 CEU.vcf.gz 2 M 14.6
(50 records omitted)
\\( vtools update variant --from_stat 'nCase=#(alt)' --samples 'aff=2'
INFO: 35 samples are selected
Counting variants: 100% [=============================] 35 1.4K/s in 00:00:00
INFO: Adding variant info field nCase with type INT
Updating variant: 100% [============================] 288 72.3K/s in 00:00:00
INFO: 288 records are updated
\\( vtools update variant --from_stat 'nCtrl=#(alt)' --samples 'aff=1'
INFO: 25 samples are selected
Counting variants: 100% [=============================] 25 2.0K/s in 00:00:00
INFO: Adding variant info field nCtrl with type INT
Updating variant: 100% [============================] 288 78.5K/s in 00:00:00
INFO: 288 records are updated
`vtools select
)The master variant table variant
contains all variants in the project, we can select subsets of variants from this table and create multiple variant tables.
\\( vtools show fields
variant.chr (char) Chromosome name (VARCHAR)
variant.pos (int) Position (INT, 1-based)
variant.ref (char) Reference allele (VARCHAR, - for missing allele of an
insertion)
variant.alt (char) Alternative allele (VARCHAR, - for missing allele of an
deletion)
variant.AA (char)
variant.DP (int)
variant.num (int) Created from stat "#(alt)" with type INT on Sep19
variant.hom (int) Created from stat "#(hom)" with type INT on Sep19
variant.het (int) Created from stat "#(het)" with type INT on Sep19
variant.maf (float) Created from stat "maf()" with type FLOAT on Sep19
variant.num12 (int) Created from stat "#(alt)" for samples ["sample_name like
'NA12%'"]with type INT on Sep19
variant.hom12 (int) Created from stat "#(hom)" for samples ["sample_name like
'NA12%'"]with type INT on Sep19
variant.het12 (int) Created from stat "#(het)" for samples ["sample_name like
'NA12%'"]with type INT on Sep19
variant.maf12 (float) Created from stat "maf()" for samples ["sample_name like
'NA12%'"]with type FLOAT on Sep19
variant.nCase (int) Created from stat "#(alt)" for samples ['aff=2']with type INT
on Sep19
variant.nCtrl (int) Created from stat "#(alt)" for samples ['aff=1']with type INT
on Sep19
Select variant and output to another table (option --to_table
)
\\( vtools select variant 'maf > 0.1' -t common 'Variants with MAF > 0.1'
Running: 0 0.0/s in 00:00:00
INFO: 89 variants selected.
\\( vtools show tables
table #variants date message
common 89 Sep19 Variants with MAF > 0.1
variant 288 Sep19 Master variant table
\\( vtools output common chr pos ref alt AA maf -l 10
1 41342 T A . 0.241666666667
1 45162 C T c 0.166666666667
1 52066 T C C 0.15
1 53534 G A G 0.15
1 98173 T C . 0.483333333333
1 225792 G A . 0.116666666667
1 742429 G A g 0.141666666667
1 742584 A G a 0.141666666667
1 743268 C A a 0.116666666667
1 743288 T C t 0.308333333333
Count the number of variants
\\( vtools select common 'hom > 2' --count
Counting variants: 0 0.0/s in 00:00:00
45
or output them,
\\( vtools select common 'hom > 2' -o chr pos hom het maf -l 10
1 41342 3 23 0.241666666667
1 45162 4 12 0.166666666667
1 98173 10 38 0.483333333333
1 742429 43 17 0.141666666667
1 742584 43 17 0.141666666667
1 743268 46 14 0.116666666667
1 743288 29 25 0.308333333333
1 743712 5 25 0.291666666667
1 744197 44 16 0.133333333333
1 744366 43 17 0.141666666667
You can also select variants from specified samples
\\( vtools select common --samples "sample_name = 'NA12776'" -t common_in_12776
INFO: 1 samples are selected by condition: sample_name = 'NA12776'
Running: 0 0.0/s in 00:00:00
INFO: 88 variants selected.
`vtools use
)\\( vtools show annotations -v0 -l 10
CancerGeneCensus-20130711
CancerGeneCensus
CosmicCodingMuts-v67_20131024
CosmicCodingMuts
CosmicMutantExport-v67_241013
CosmicMutantExport
CosmicNonCodingVariants-v67_241013
CosmicNonCodingVariants
DGV-hg18_20130723
DGV-hg19_20130723
(77 records omitted)
\\( vtools use refGene
INFO: Downloading annotation database from annoDB/refGene.ann
ERROR: Annotation database cannot be used because it is based on a reference genome that
is different from the one used by the project. Please use a version of annotation databse
for the project (vtools show annotations), or liftover the existing project (vtoos liftover)
to make it compatible with the annotation database.
\\( vtools liftover hg19
INFO: Downloading liftOver chain file from UCSC
INFO: Exporting variants in BED format
Exporting variants: 100% [==================================] 288 137.0K/s in 00:00:00
INFO: Running UCSC liftOver tool
Updating table variant: 100% [================================] 288 1.6K/s in 00:00:00
Coordinates in hg18
\\( vtools output variant chr pos ref alt -l 10
1 533 G C
1 41342 T A
1 41791 G A
1 44449 T C
1 44539 C T
1 44571 G C
1 45162 C T
1 52066 T C
1 53534 G A
1 75891 T C
Coordinates in hg19
\\( vtools output variant chr pos ref alt -l 10 --build hg19
1 10533 G C
1 51479 T A
1 51928 G A
1 54586 T C
1 54676 C T
1 54708 G C
1 55299 C T
1 62203 T C
1 63671 G A
1 86028 T C
\\( vtools use refGene
INFO: Downloading annotation database from annoDB/refGene.ann
INFO: Downloading annotation database from http://vtools.houstonbioinformatics.org/annoDB/refGene-hg19_20130904.DB.gz
INFO: Using annotation DB refGene as refGene in project tutorial.
INFO: Known human protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq).
\\( vtools show annotation refGene
Annotation database refGene (version hg19_20130904)
Description: Known human protein-coding and non-protein-coding genes taken
from the NCBI RNA reference sequences collection (RefSeq).
Database type: range
Reference genome hg19: chr, txStart, txEnd
name (char) Gene name
chr (char)
strand (char) which DNA strand contains the observed alleles
txStart (int) Transcription start position (1-based)
txEnd (int) Transcription end position
cdsStart (int) Coding region start (1-based)
cdsEnd (int) Coding region end
exonCount (int) Number of exons
exonStarts (char) Starting point of exons (adjusted to 1-based positions)
exonEnds (char) Ending point of exons
score (int) Score
name2 (char) Alternative name
cdsStartStat (char) cds start stat, can be 'non', 'unk', 'incompl', and 'cmp1'
cdsEndStat (char) cds end stat, can be 'non', 'unk', 'incompl', and 'cmp1'
\\( vtools show fields
variant.chr (char) Chromosome name (VARCHAR)
variant.pos (int) Position (INT, 1-based)
variant.ref (char) Reference allele (VARCHAR, - for missing allele of an
insertion)
variant.alt (char) Alternative allele (VARCHAR, - for missing allele of an
deletion)
variant.AA (char)
variant.DP (int)
variant.num (int) Created from stat "#(alt)" with type INT on Sep19
variant.hom (int) Created from stat "#(hom)" with type INT on Sep19
variant.het (int) Created from stat "#(het)" with type INT on Sep19
variant.maf (float) Created from stat "maf()" with type FLOAT on Sep19
variant.num12 (int) Created from stat "#(alt)" for samples ["sample_name like
'NA12%'"]with type INT on Sep19
variant.hom12 (int) Created from stat "#(hom)" for samples ["sample_name like
'NA12%'"]with type INT on Sep19
variant.het12 (int) Created from stat "#(het)" for samples ["sample_name like
'NA12%'"]with type INT on Sep19
variant.maf12 (float) Created from stat "maf()" for samples ["sample_name like
'NA12%'"]with type FLOAT on Sep19
variant.nCase (int) Created from stat "#(alt)" for samples ['aff=2']with type INT
on Sep19
variant.nCtrl (int) Created from stat "#(alt)" for samples ['aff=1']with type INT
on Sep19
variant.alt_chr (char)
variant.alt_pos (int)
refGene.name (char) Gene name
refGene.chr (char)
refGene.strand (char) which DNA strand contains the observed alleles
refGene.txStart (int) Transcription start position (1-based)
refGene.txEnd (int) Transcription end position
refGene.cdsStart (int) Coding region start (1-based)
refGene.cdsEnd (int) Coding region end
refGene.exonCount (int) Number of exons
refGene.exonStarts (char)
Starting point of exons (adjusted to 1-based positions)
refGene.exonEnds (char) Ending point of exons
refGene.score (int) Score
refGene.name2 (char) Alternative name
refGene.cdsStartStat (char)
cds start stat, can be 'non', 'unk', 'incompl', and 'cmp1'
refGene.cdsEndStat (char)
cds end stat, can be 'non', 'unk', 'incompl', and 'cmp1'
\\( vtools output variant chr pos refGene.name refGene.name2 -l10
1 533 . .
1 41342 . .
1 41791 . .
1 44449 . .
1 44539 . .
1 44571 . .
1 45162 . .
1 52066 . .
1 53534 . .
1 75891 . .
\\( vtools select variant 'refGene.chr IS NOT NULL' -t in_gene
Running: 4 902.7/s in 00:00:00
INFO: 121 variants selected.
\\( vtools output in_gene chr pos refGene.name refGene.name2 -l10
1 695745 NR_033908 LOC100288069
1 697749 NR_033908 LOC100288069
1 703777 NR_033908 LOC100288069
1 703882 NR_033908 LOC100288069
1 743268 NR_103536 FAM87B
1 743288 NR_103536 FAM87B
1 743404 NR_103536 FAM87B
1 743712 NR_103536 FAM87B
1 744074 NR_103536 FAM87B
1 744197 NR_103536 FAM87B
\\( vtools use dbSNP
INFO: Downloading annotation database from annoDB/dbSNP.ann
INFO: Downloading annotation database from http://vtools.houstonbioinformatics.org/annoDB/dbSNP-hg19_138.DB.gz
INFO: Using annotation DB dbSNP as dbSNP in project tutorial.
INFO: dbSNP version 138, created using vcf file downloaded from NCBI
\\( vtools select common 'dbSNP.chr IS NOT NULL' -t in_dbSNP
Running: 0 0.0/s in 00:00:00
INFO: 83 variants selected.
\\( vtools output in_dbSNP chr pos ref alt dbSNP.name -l 10
1 41342 T A rs116400033
1 45162 C T rs10399749
1 52066 T C rs28402963
1 53534 G A rs116440577
1 98173 T C rs74747225
1 225792 G A rs200488504
1 742429 G A rs3094315
1 742584 A G rs3131972
1 743268 C A rs61770173
1 743288 T C rs3131970
Select damaging variants
\\( vtools use dbNSFP
INFO: Downloading annotation database from annoDB/dbNSFP.ann
INFO: Downloading annotation database from http://vtools.houstonbioinformatics.org/annoDB/dbNSFP-hg18_hg19_2_4.DB.gz
INFO: Decompressing /Users/bpeng1/.variant_tools/annoDB/dbNSFP-hg18_hg19_2_4.DB.gz
INFO: Using annotation DB dbNSFP as dbNSFP in project tutorial.
INFO: dbNSFP version 2.4, maintained by Xiaoming Liu from UTSPH. Please cite
"Liu X, Jian X, and Boerwinkle E. 2011. dbNSFP: a lightweight database of human
non-synonymous SNPs and their functional predictions. Human Mutation. 32:894-899" and
"Liu X, Jian X, and Boerwinkle E. 2013. dbNSFP v2.0: A Database of Human Nonsynonymous
SNVs and Their Functional Predictions and Annotations. Human Mutation. 34:E2393-E2402."
if you find this database useful.
Unfortunately, this dataset does not have any nonsynonymous variants recorded in dbNSFP
\\( vtools select variant 'dbNSFP.chr is not NULL' -c
Counting variants: 0 0.0/s in 00:00:00
0
ref_genome
\\( vtools output common chr pos ref 'ref_sequence(chr, pos)' -l 10
1 41342 T T
1 45162 C C
1 52066 T T
1 53534 G G
1 98173 T T
1 225792 G G
1 742429 G G
1 742584 A A
1 743268 C C
1 743288 T T
\\( vtools output common chr pos ref 'ref_sequence(chr, pos-5, pos+5)' -l 10
1 41342 T TGTATTTTATG
1 45162 C CTATGCGACCT
1 52066 T TATTATATTTT
1 53534 G TCGTCGGCTCA
1 98173 T TTCCATAAGAA
1 225792 G TTTTCGTGTGC
1 742429 G GAAACGTTTGA
1 742584 A GGAAAAGGGAA
1 743268 C GATGGCGGGAA
1 743288 T CTCTGTGGGCC
genotype
and samples
You can check the genotype of variants in a particular sample
\\( vtools output common chr pos ref alt "genotype('NA07037')" -l 10
1 41342 T A 0
1 45162 C T 2
1 52066 T C 0
1 53534 G A 1
1 98173 T C 1
1 225792 G A 0
1 742429 G A 2
1 742584 A G 2
1 743268 C A 2
1 743288 T C 2
or in all samples
\\( vtools output common chr pos "genotype()" -l 2
1 41342 0,1,1,0,0,1,0,1,1,0,0,0,1,0,0,1,0,0,1,0,1,0,1,1,2,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,2,1,1,1,0,0,0,1,1,2,0,1,0,0,1,1,0,0,0,0
1 45162 0,0,0,1,2,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,2,0,1,0,0,0,0,0,1,0,1,0,1,2,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,2,0,0,0
Function samples
outputs name of samples that harbor the variants
\\( vtools output common "samples('geno_filter=GT!=0')" -l 2
NA06986,NA06994,NA07051,NA07347,NA07357,NA11830,NA11840,NA11918,NA11920,NA11992,NA11993,NA11994,NA12005,NA12044,NA12154,NA12234,NA12414,NA12489,NA12716,NA12717,NA12760,NA12761,NA12762,NA12776,NA12814,NA12815
NA07000,NA07037,NA07051,NA11829,NA11840,NA11894,NA11995,NA12004,NA12144,NA12155,NA12234,NA12249,NA12761,NA12763,NA12814,NA12828
track
A track is an external file that contains information about variants, which can be in format vcf
, bam
and BigWig
and BigBed
.
\\( vtools output common chr pos ref alt "track('CEU.vcf.gz')" -l 10
1 41342 T A AA=.;AC=29;AN=120;DP=188
1 45162 C T AA=c;AC=20;AN=120;DP=166;HM2
1 52066 T C AA=C;AC=18;AN=120;DP=159
1 53534 G A AA=G;AC=18;AN=120;DP=243
1 98173 T C AA=.;AC=58;AN=120;DP=184
1 225792 G A AA=.;AC=14;AN=120;DP=464
1 742429 G A AA=g;AC=103;AN=120;DP=314;HM2
1 742584 A G AA=a;AC=103;AN=120;DP=366;HM3
1 743268 C A AA=a;AC=106;AN=120;DP=64
1 743288 T C AA=t;AC=83;AN=120;DP=69
The track
function is particularly useful for checking reads that cover a variant in BAM files.
ANNOVA
\\( vtools show pipeline ANNOVAR
Pipeline to call ANNOVAR and import results as variant info fields.
Available pipelines: geneanno
Pipeline "geneanno": This pipeline exports variants in specified variant table (parameter
--var_table, default to variant), executes ANNOVAR's gene-based annotation
(annotate_variantion.pl --geneanno), and imports specified fields from output of the command.
Four fields (two for all variants and two for exonic variants) will be imported unless you
disable some of them using parameters --variant_info and --exonic_info. No input or output
file is required for this pipeline, but a snapshot could be specified, in which case the
snapshot will be loaded (and overwrite the present project).
geneanno_0: Load specified snapshot if a snapshot is specified. Otherwise use the
existing project.
geneanno_10: Check the existence of ANNOVAR's annotate_variation.pl command.
geneanno_11: Determine the humandb path of ANNOVAR
geneanno_14: Download gene database for specified --dbtype if they are unavailable
geneanno_20: Export variants in ANNOVAR format
geneanno_30: Execute ANNOVAR annotate_variation.pl --geneanno
geneanno_40: Importing results from ANNOVAR output .variant_function if
--variant_info is specified
geneanno_50: Importing results from ANNOVAR output .exonic_variant_function if
--exonic_info is specified
Pipeline parameters:
var_table Variant table for the variants to be analyzed. (default: variant)
annovar_path Path to a directory that contains annotate_variation.pl, if the script
is not in the default \\(PATH.
dbtype --dbtype parameter that will be passed to annotate_variation.pl
--dbtype. The default value if refGene, but you can also use knownGene,
ensGene. (default: refGene)
variant_info Fields to import from the first two columns of .variant_function output
of ANNOVAR. (default: region_type, region_name)
exonic_info Fields to import from the .exonic_variant_function output of ANNOVAR.
It has to be zero, one or more of mut_type and function. (default:
mut_type, function)
Using ANNOVAR to annotate variants
\\( vtools execute ANNOVAR --annovar_path ~/bin/annovar/
INFO: Executing ANNOVAR.geneanno_0: Load specified snapshot if a snapshot is specified. Otherwise use the existing project.
INFO: Executing ANNOVAR.geneanno_10: Check the existence of ANNOVAR's annotate_variation.pl command.
INFO: Command /Users/bpeng1/bin/annovar//annotate_variation.pl is located.
INFO: Executing ANNOVAR.geneanno_11: Determine the humandb path of ANNOVAR
INFO: Running which /Users/bpeng1/bin/annovar//annotate_variation.pl > cache/annovar.path
INFO: Executing ANNOVAR.geneanno_14: Download gene database for specified --dbtype if they are unavailable
INFO: Running /Users/bpeng1/bin/annovar//annotate_variation.pl --buildver hg18 -downdb refGene /Users/bpeng1/bin/annovar//humandb
INFO: Executing ANNOVAR.geneanno_20: Export variants in ANNOVAR format
INFO: Running vtools export variant --format ANNOVAR --output cache/annovar_input
INFO: Executing ANNOVAR.geneanno_30: Execute ANNOVAR annotate_variation.pl --geneanno
INFO: Running /Users/bpeng1/bin/annovar//annotate_variation.pl --geneanno --dbtype refGene --buildver hg18 cache/annovar_input /Users/bpeng1/bin/annovar//humandb
INFO: Executing ANNOVAR.geneanno_40: Importing results from ANNOVAR output .variant_function if --variant_info is specified
INFO: Using primary reference genome hg18 of the project.
Getting existing variants: 100% [==================================] 288 206.3K/s in 00:00:00
INFO: Updating variants from cache/annovar_input.variant_function (1/1)
annovar_input.variant_function: 100% [==============================] 288 20.8K/s in 00:00:00
INFO: Fields region_type, region_name of 288 variants are updated
INFO: Running vtools update variant --from_file cache/annovar_input.variant_function --format ANNOVAR_variant_function --var_info region_type, region_name
INFO: Executing ANNOVAR.geneanno_50: Importing results from ANNOVAR output .exonic_variant_function if --exonic_info is specified
INFO: Using primary reference genome hg18 of the project.
Getting existing variants: 100% [==================================] 288 160.6K/s in 00:00:00
INFO: Updating variants from cache/annovar_input.exonic_variant_function (1/1)
annovar_input.exonic_variant_function: 0 0.0/s in 00:00:00
INFO: Fields mut_type, function of 0 variants are updated
INFO: Running vtools update variant --from_file cache/annovar_input.exonic_variant_function --format ANNOVAR_exonic_variant_function --var_info mut_type, function
Save a snapshot of the project
\\( vtools admin --save_snapshot ACM-BCB2014-tutorial.tgz 'Tutorial section for AVM BCB meeting'
ACM-BCB2014-tutorial.tgz: 100% [================================] 358,722 13.6M/s in 00:00:00
INFO: Snapshot ACM-BCB2014-tutorial.tgz has been saved
You can load it using command
\\( vtools admin --load_snapshot ACM-BCB2014-tutorial.tgz
Extracting ACM-BCB2014-tutorial.tgz: 0.0% [> ] in 00:00:00
Extracting ACM-BCB2014-tutorial.tgz: 100% [=======================] 74,928 4.5M/s in 00:00:00
INFO: Snapshot ACM-BCB2014-tutorial.tgz has been loaded
It is strongly recommended that you save copies of your project (snapshots) during the analysis of real-world projects.