When vtools import
imports a file, it creates one or more samples associated with this file. This works well if all genotypes belong to a sample are contained in one file, but not so if the genotypes are scattered in multiple files (e.g. chromosome by chromosome). This tutorial demonstrates how to use vtools admin
command to handle sample genotypes that are imported from multiple files.
We import genotypes for two samples, with names MG1000
and MG1004
. SNV and Indels are imported from different files but are imported with the same sample names.
vtools init proj
vtools import snv/MG1000-240.snp.txt.vcf --build hg18 --sample_name MG1000
vtools import snv/MG1004-200.snp.txt.vcf --sample_name MG1004
vtools import indel/MG1000-240.pileup.indel --format pileup_indel --sample_name MG1000
vtools import indel/MG1004-200.pileup.indel --format pileup_indel --sample_name MG1004
variant tools treats each line in the sample table as a separate sample, even if some of them share the same sample names.
when we list the samples in this project, we can see four samples:
vtools show samples
sample_name filename
MG1000 snv/MG1000-240.snp.txt.vcf
MG1000 indel/MG1000-240.pileup.indel
MG1004 snv/MG1004-200.snp.txt.vcf
MG1004 indel/MG1004-200.pileup.indel
If we try to, for example, count the number of variants in each sample, we can use the vtools phenotype
command:
vtools phenotype --from_stat 'het=#(het)' 'hom=#(hom)' 'GT=#(GT)'
Calculating phenotype: 100% [==========================================>] 4 0.4/s in 00:00:09
INFO: 12 values of 3 phenotypes (3 new, 0 existing) of 4 samples are updated.
The statistics (phenotypes) are available in the sample table
vtools show samples
sample_name filename het hom GT
MG1000 snv/MG1000-240.snp.txt.vcf 2384712 1319373 3708513
MG1000 indel/MG1000-240.pileup.indel 623571 224378 847949
MG1004 snv/MG1004-200.snp.txt.vcf 2372093 1328117 3704532
MG1004 indel/MG1004-200.pileup.indel 605539 231405 836944
which can also be displayed using command vtools phenotype --output
if you are only interested in a subset of phenotypes:
vtools phenotype --output sample_name hom het GT
sample_name filename het hom GT
MG1000 1319373 2384712 3708513
MG1004 1328117 2372093 3704532
MG1000 224378 623571 847949
MG1004 231405 605539 836944
The problem is that genotypes imported from snv
and indel
folders belong to the same physical samples. Although it is useful to know the number of SNVs and Indels separately, genotypes belongs to the same physical sample should better be treated together because
handling of phenotype: Phenotypes such as blood pressure, sex, and weight are defined for physical samples. Although command vtools phenotype
allows you to import phenotype by sample name and will automatically replicate phenotypes to all samples with the same sample name, it is confusing to have several ““samples”” with the same phenotypes.
association analysis: Association analysis are based on samples and sometimes use sample size for its analysis. Spreading genotypes that belong to the same physical sample into several samples will lead to erroneous results.
export variants: variants are exported by samples. SNVs and indels will be exported as different samples in this example.
Because samples are merged by their names, it is important to double check the names of samples before you merge samples. Because samples are imported with correct names in this tutorial, we do not have to rename samples. Otherwise, you might have to rename samples using command vtools admin --rename_samples
. For example, assuming that we did not use any of the --sample_name
option during import, we would get a sample table as follows:
sample_name filename het hom GT
SAMPLE snv/MG1000-240.snp.txt.vcf 2384712 1319373 3708513
None indel/MG1000-240.pileup.indel 623571 224378 847949
SAMPLE snv/MG1004-200.snp.txt.vcf 2372093 1328117 3704532
None indel/MG1004-200.pileup.indel 605539 231405 836944
Samples imported from vcf files have a dummy sample name obtained from the header of the vcf files, others have None
because the pileup.indel format does not have a header. If you merge samples, SNVs and Indels would be merged together as samples SAMPLE
and None
, which is definitely not we want.
vtools admin --merge_samples
are used to merge genotypes of the same physical samples. If samples share genotypes for the same variants, this command will fail with an error message.
To correct this problem, we can run
vtools admin --rename_samples 'filename like "%MG1000%"' MG1000
vtools admin --rename_samples 'filename like "%MG1004%"' MG1004
After we make sure that samples to be merged share the same sample names, we can merge samples using command vtools admin --merge_samples
:
vtools admin --merge_samples
INFO: 4 samples that share identical names are merged to 2 samples
Merging MG1004: 100% [=====================================================>] 5 2.2/s in 00:00:02
Merging MG1000: 100% [=====================================================>] 5 2.1/s in 00:00:02
The new sample table looks like:
vtools show samples
sample_name filename het hom GT
MG1000 snv/MG1000-240.snp.txt.vcf,indel/MG1000-240.pileup.indel 2384712 1319373 3708513
MG1004 snv/MG1004-200.snp.txt.vcf,indel/MG1004-200.pileup.indel 2372093 1328117 3704532
Merging samples will merge genotypes from original samples. Source filenames will also be updated. Phenotypes, if any, are copied from one of the original samples. The number of homozygotes, heterozygotes, and genotypes are therefore wrong in the above sample table and we need to update them by running vtools phenotype --from_stat
again.
vtools phenotype --from_stat 'het=#(het)' 'hom=#(hom)' 'GT=#(GT)'
Calculating phenotype: 100% [=========================================>] 2 0.5/s in 00:00:04
INFO: 6 values of 3 phenotypes (0 new, 3 existing) of 2 samples are updated.
The statistics (phenotypes) are now
vtools show samples
sample_name filename het hom GT
MG1000 snv/MG1000-240.snp.txt.vcf,indel/MG1000-240.pileup.indel 3008283 1543751 4556462
MG1004 snv/MG1004-200.snp.txt.vcf,indel/MG1004-200.pileup.indel 2977632 1559522 4541476