Variant tools can import and export text files (or gzipped text files) in delimiter-separated format, namely records that are separated into columns by delimiters such as tab, space or comma. The file format must be variant-oriented (storing one or more variants by line), with the exception of sample-based PLINK format which is preprocessed internally into variant-oriented form for import. We use terms
chr
, pos
, ref
, alt
),A format specification defines how to extract fields from columns and how to write columns from fields, to
vtools import
. The text file should provide at least four fields for chromosome, position, reference and alternative alleles. In addition to variant and variant info, an input file can contain genotype and genotype info for one or more samples.The four fields could, but not necessarily have to, be in four columns because a field could be synthesized from multiple columns, even not from any existing column. For example, if all variants are from dbSNP, ref
and alt
fields could be looked up from a dbSNP annotation database.
Update existing variants with additional info using command vtools update
. The text files can have
full variant information (chr
, pos
, ref
and alt
) to update matching variants and related genotypes and their info, or
chromosomal positions (chr
and pos
) to update all variants at specified locations, or
range (chr
, start
, end
) to update variants in specified regions
Export samples and their variants using command vtools export
. The file format should be organized by variant (not by sample). Although each line of the output file usually present a variant (chr
, pos
, ref
, alt
), it can contain multiple variants at a locus (chr
, pos
, ref
, mutliple alternative alleles), or only locus-specific information (chr
, pos
).
.fmt
file)A .fmt
file describes the format of an external file so that it can be handled by variant tools. An format specification file follows an extended .ini
format with section and key/value pairs. Details of this format can be found at this link because it is parsed by a Python ConfigParser object. This file should have at least a [format description]
section and a few field
sections. A format specification could be used to import and export files but you can ignore import or export specifications if you are only interested in exporting or importing.
[format description]
This section should have keys
description
: a description of the format, with preferably a URL to the documentation of the format.
encoding
(optional): encoding of the input file. If you see errors such as "codec can't decode byte 0xe7 in position 21480"
, you should perhaps add encoding=ISO-8859-1
to your .fmt
file.
delimiter
(optional): delimiter to define columns in the input file. Default to tab (delimiter='\t'
). If your input can be either tab or space delimited, use delimiter=None
if your input does not have any missing field because '\t\t'
or ' '
will be treated as a single delimiter. Quotations around other delimiters can be ignored (e.g. ,
instead of ','
).
header
(optional): variant tools by default treats lines with a leading #
as header and skip them during data importing. The header will also be used to probe sample names for some formats (e.g. .vcf). If your input file does not use such a header, you can override this behavior by specifying the number of lines to skip (header=1
), or a pattern by which headers are matched (e.g. use header=^%.*
for header lines starting with %
, see documents of the Python re package for details).
preprocessor
(optional): a functor to pre-process input data into some intermediate format that can be readily imported by variant tools (e.g., PlinkConverter()
)
export_by
(optional, for export only): How variants at the same locus are exported. By default, variant tools exports one line for each variant. If fields specified by export_by
(e.g. chr
, pos
) cannot uniquely identify variants so multiple variants will be outputted in one line, values at extra records will be ignored, or be collapsed to the first record if supported by column specification.
sort_output_by
(optional, for export only): How variants are ordered. This option is ignored if export_by
is defined.
THIS OPTION IS CURRENTLY NOT USED
merge_by
(optional, for import only): If genotype at different homozygous copies (ploidy) is recorded in separate lines, this option has to be used to specify the columns to uniquely identify a variant. These columns will be passed to the processor without modification, values at all other columns are merged (separated by ‘,').
One of
variant
: fields for chr
, pos
, ref
, and alt
. Such input files have complete variant information and can be used to add new variants or add or update info of existing variants.
position
: fields for chr
and pos
for chromosomal positions.
range
: fields for chr
, start
and end
positions of regions.
input files of the latter two types can only be used to update matching variants with additional variant_info
.
variant_info
(optional): Additional fields that will be inserted to specified variant table.
genotype
(optional): name of a field that will be imported into the sample variant table. Using a slice-index syntax, this field can process genotypes for multiple samples (e.g. index=8:
will produce genotypes from columns 8, 9, …). The return value of this field must be value 1
for heterozygote, 2
for homozygote, and -1
for double heterozygote (alt1, alt2).
genotype_info
(optional): Fields such as quality score that will be appended to genotype of each sample. They should either generate a single value for all samples (regular index), or a different value for each sample (slice-index).
General speaking, genotype info fields are sample specific so different samples can have different values for the same variant. Variant info fields are supposed to be properties of variants and are shared by all (or part of the) samples. It is up to the user (and file format) to decide how to handle particular fields from input files.
[DEFAULT]
(define parameters of a format)The default section defines values that exist, conceptually, in all sections. It is used to define parameters used for import or export. These parameters could be overriden using corresponding command line option. For example, in the following .fmt file (partial)
[format description]
description=Output from ANNOVAR, generated from command "path/to/annovar/annotate_variation.pl annovar.txt path/to/annovar/humandb/". This format imports chr, pos, ref, alt and ANNOVAR annotations. For details please refer to http://www.openbioinformatics.org/annovar/annovar_gene.html
variant=chr, pos, ref, alt
variant_info=%(var_info)s
[DEFAULT]
var_info=mut_type
var_info_comment=Fields to be outputted, can be one or both of mut_type and function
The key var_info
has a help message defined by parameter var_info_comment
, and a default value mut_type
. If you query the details of this format using command
% vtools show format ANNOVAR_output
you will see at the end of the output the following description:
Other fields (usable through parameters):
function the gene name, the transcript identifier and the sequence change in
the corresponding transcript
Format parameters:
var_info Fields to be outputted, can be one or both of mut_type and function.
(default: mut_type)
This means you can pass an alternative value of var_info
to this format using parameters such as --var_info function
to change the variant information field to be imported or updated. For example, you can use command
% vtools update variant input_file --format ANNOVAR_output --var_info mut_type function
to update two fields mut_type
and function
instead of the default one (mut_type
).
Name of a variable cannot be any keyword (e.g. field
, comment
, adj
, type
) or start with fmt_
.
Import fields sections describe the fields that can be imported if they appear in one of variant
, position
, range
, variant_info
, genotype
and genotype_info
. Because the value of these fields could be overridden using corresponding parameters of command vtools import
, a .fmt
file might define additional or alternative fields to provide alternative ways to import data in this format.
Each field section can have the following keys:
index
: index(es) of field, which can be
A 1-based index of the column in the input file. The value at this column will be used for this field.
A tuple syntax with multiple indexes separated by ','
, for example 5,7
. A list of values at specified columns will be passed to an adjust function to produce values of the field.
A ‘slice’ syntax to specify multiple fields. This syntax will only be used for genotype and genotype information fields. For example, index=8:
will produce multiple fields from columns 8, 9, ...
till end of the columns. Other examples include index=8::2
for 8, 10, 12, ...
, index=5,8:
for (5,8)
, (5,9)
, …, and index=8::2,9::2
for (8,9)
, (10,11)
, etc
type
: can be any SQL allowed types such as “INTEGER”, “FLOAT”, “INTEGER NOT NULL”, or “DECIMAL(7,6) NOT NULL”
comment
(optional) a description of the field
adj
(optional): Functions or functors to extract or modify one or more values from the field value. variant tools provides a number of functors to help the processing of different fields. For example,
IncreaseBy(inc=1)
(increase value by inc
). This is usually used to adjust 0-based position to 1-based position that is used by variant tools.
MapValue(map)
(map value to another). This function is used to map input value to another one, for example, MapValue({'het': '1', 'hom': '2'})
maps value ‘het’ to ‘1’ and ‘hom’ to ‘2’. Note that the mapped values should be strings or None
.
Nullify(val)
(treat value as NULL). This is usually used to adjust input NA
values to NULL
in the database.
RemoveLeading(val)
(remove leading characters). This is usually used to remove leading chr
from inputs with chromosomes such as chr1
, chr2
, because variant tools only stores 1
, 2
, etc.
ExtractField(index, sep=';', default=None)
(field of a column): Split the item by sep
and return the index
-th field (1-based). Return default
if there are less than index
fields. For example, ExtractField(2, ':')
extracts 20
from 10:20:30
.
Multiple functors/functions can be used sequentially. For example, ExtractValue('CEU:'), ExtractField(1, '/')
extracts value 12
from YRI:20/222;CEU:12/45
(the first extractor gets 12/45
, the second extractor gets 12
), and ExtractValue('CEU:'), SplitField('/')
extracts values 12
and 45
from YRI:20/222;CEU:12/45
.
If none of the provided extractors can be used, you can define your own function. For example, lambda x: x[6:].replace(",", "")
extracts 24000
from COUNT=24,000
. You can also mix function with variant tools provided extractor ExtractValue("COUNT="), lambda x: x.replace(",", "")])
to extract 24000
from values such as ASD;COUNT=24,000;MAF=0.5
.
If the return value of at least one of the fields is a tuple of more than one values (e.g. result from functor SplitField
), multiple records will be generated from this line. Non-tuple values of other fields will be copied automatically. For example, if three fields have values A1
, (B1,)
, (C1, C2)
, they will produce two records of A1, B1, C1
and A1, None, C2
.
If you would like to exclude certain records, you can use adj
to produce invalid records that will be excluded during importing. For example, a record with None
alternative allele, or a field with NOT NULL
type will be ignored, or a genotype with None
mutation type will be ignored.
[Field formatter]
(export: how to format fields for output)This section specifies how to format fields when they are outputted to a file. Fields that are not listed in this section will be converted to string directly, unless a special wildcast formatter fmt_*
is specified. This section should look like
[Field formatter]
fmt_chr=Prepend('chr')
fmt_freq=Formatter('{.3f}')
fmt_id=ValueOfNull('.')
fmt_other=lambda x: something
The name of the formatter should be field name prepended by fmt_
. Formatter for each field should be a functor or a function. Their return value must be a string. More interestingly, multiple fields could be formatted altogether so it is possible to specify
fmt_PL1,fmt_PL2,fmt_PL3=lambda x: x[0] + x[1] + x[2]
to create a single string from multiple fields.
Export reverses the import process. Instead of extracting fields from one or more columns, it generates columns from one or more fields. Column sections have title col_#
where #
is the index of column. They are specified in a similar way to fields. File formats without column specification cannot be used to export variants and samples.
Each column section can have the following keys:
field
: name of field or fields that will be used to create a column. The fields are usually defined in this .fmt file. If a field specifies genotype or genotype information of more than one sample using a slice syntax, multiple columns will be generated. If multiple records are collapsed into a single record(c.f. export_by
), a tuple of values will be passed to the adj
function/functor if it is defined. Otherwise, only the first record will be exported.
adj
: function or functor that accepts values at one or more fields and produce a single value at this column.
comment
(optional) a description of the column
The basic steps of outputting a file is to collect values of all fields, apply formatters to each field (or groups of fields) if available, and collect results to columns. Note that
Although fields of columns are usually the ones that are defined in a .fmt file, arbitrary fields could be outputted if fields of a column are configurable through parameters.
A column is considered as a genotype column if it contains field GT
. Multiple columns will be exported if genotypes of multiple samples are outputted.
A field can generate multiple columns by using a formatter that returns, for example, ‘1,2’ for a comma-separated format.
After a file is created, you can use command
vtools import --format /path/to/fmt_file input_file --build specified_build
to import your file, and use command
vtools show table variant -l -1
to check if variants are correctly imported. If you believe that your format is commonly used, please send the .fmt
file to our mailinglist. We will post the file to the repository so that others can make use of it.
You can learn from the system format files on how to define a format:
You can have a look at system format files using
vtools show formats
vtools show format SOME_FORMAT
more cache/SOME_FORMAT.fmt
The first command gets a list of supported formats. The second command gets details of a specific format, and the third command lets you view the format file because all used formats are stored in a temporary directory named cache
under your project folder.