Variants in a variant tools project are stored in a master variant table after they are imported from external data files. Multiple variant info fields could be added to this table to describe these variant. These fields could be imported from a file when variants are imported, or updated from files after variants are imported, calculated from samples as sample statistics, or derived from other variant info or annotation fields. Variant info fields are part of a project and are usually project-specific.
Annotation fields are provided in annotation databases and are available to a project after they are linked to the project. Conceptually, annotation databases add columns to the master variant table so that you can select variants based on these fields, or outputvariants with these annotation fields. A key difference between an variant info field and an annotation field is that variant info is unique for each variant whereas there can be multiple annotation values for a single variant. Note that annotation databases vary greatly in number of fields and coverage of variants and usually do not provide annotation for all variants.
The annotation features of variant tools involve mostly commands vtools use
, vtools output
, and vtools export
.
vtools use
creates, downloads, and links annotation databases to a project. It accepts the name of an annotation database and try to locate it locally (current directory, ~/.variant_tools
), online (usually <a class='urllink' href='http://vtools.houstonbioinformatics.org/annoDB' rel='nofollow'>http://vtools.houstonbioinformatics.org/annoDB</a>
), and build an annotation database from source files if no existing database could be found. When a database is linked to a project, all its annotation fields becomes available to the project.Command vtools show fields
lists all variant info and annotation fields of a project.
Command vtools output
outputs variants in a variant table along with their info fields and annotation fields. Conceptually speaking, the master variant table and all the variant info and annotation fields form a huge table with variants as rows and fields as columns. This command outputs subsets of variants and fields to the standard output. (As an advanced feature, this command can also outputs summary statistics of variants and fields).
Command vtools export
export variant in specific formats. This command is similar to vtools output
but it exports variants and related fields in user-specified formats.
We will demonstrate the use of these commands in the Tutorial.
Variant tools supports four different types of annotation databases where each type of databases links to variants using a different method. An annotation database can support one or more reference genomes and it must match either the primary or the alternative reference genome of a project to be linked to the project, unless it is a field database that annotate another field such as gene name.
Variant-based annotation databases annotate specific variants. They contain annotation information for variants (chr, pos, ref, alt). For example, the dbNSFP database lists, among about 20 annotation fields, reference and mutated amino acid, nonsynonymous-to-synonymous-rate ratio, SIFT, PolyPhen2, MutationTaster and other scores, allele frequencies in dbSNP and the 1000 genomes project. Variant tools currently provides the following variant-based annotation databases:
evs
was created from EVS on November 7, 2011 with approximately 2500 exomes; evs_5400
was created from EVS on December 15, 2011 with approximately 5400 exomes).Position-based databases annotate chromosomal locations. Such databases provide annotation for all variants at a locus, mostly because there is no variant-specific information available. For example, the gwasCatalog database contains chromosomal locations of susceptibility loci of all published genome wide association studies.
Range-based databases annotate regions of chromosomes. These databases are used to annotate regions of chromosomes, such as genes, exon regions of genes, and highly-conserved regions. Variant tools provides the following range-based annotation databases:
Field-based annotation databases annotate variants indirectly through other variant info or annotation fields. For example, the keggPathway database lists all the pathways genes belong so it technically annotate gene IDs, not variants. To use this database, you will need to first link the project to a database that provides IDs of genes each variant belongs, and then link the keggPathway database to the project through gene ID. Therefore, a –linked_by field is required to use a field-based database. Variant tools provides the following field-based annotation databases:
If you would like to use an annotation database that is not provided by variant tools, you could write a customized .ann
file to create your own annotation database. This file tells variant tools the type of annotation database, reference genome, URL to source files, version, and more importantly, the type of each annotation field and how to extract them from source files. A large number of functors are provided in case that you need to post-processing texts from input file to extract values of annotation fields.
Please cite
F. Anthony San Lucas, Gao Wang, Paul Scheet, and Bo Peng (2012) Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools, Bioinformatics 28 (3): 421-422.
if you find Variant Annotation Tools helpful and use it in your publication. Thank you.