Samples from two patients were sent to Complete Genomics and Illumina for whole-genome sequencing analysis. The variants are called by Complete Genomics using Complete Genomics Anslysis (CGA) package, and CASAVA 1.8 by Illumina. Before we do any further analysis, we are interested in knowing whether or not these two platforms yield comparable results.
As always, we need to find an empty directory and create a new project:
% vtools init comparison
For each sample, the variants are saved in a single file ASM/masterVarBeta-$ID-ASM.tsv.bz2
, which can be imported directly using format CGA
.
% vtools import --format CGA /path/to/ASM/masterVarBeta-CG1-ASM.tsv.bz2 \
-j2 --sample_name CG1 --build hg19
% vtools import --format CGA /path/to/ASM/masterVarBeta-CG2-ASM.tsv.bz2 \
-j2 --sample_name CG2
SNVs and INDELs are called separately and are saved by chromosomes. They can be imported using format CASAVA18_snp
and CASAVA_indel
.
% vtools import --format CASAVA18_snp /path/to/S1/Variations/snps/chr*.txt \
--sample_name Illumina1
% vtools import --format CASAVA18_snp /path/to/S2/Variations/snps/chr*.txt \
--sample_name Illumina2
% vtools import --format CASAVA18_indel /path/to/S1/Variations/indels/chr*.txt \
--sample_name Illumina1
% vtools import --format CASAVA18_indel /path/to/S2/Variations/indels/chr*.txt \
--sample_name Illumina2
Although each genotypes from each sample consist of multiple samples (in variant tools term), they can be identified by their names.
Use command vtools show samples
to see samples and their phenotypes.
All samples are imported to the master variant table so we need to separate them into separate variant tables according their sample names. This can be done using commands
% vtools select variant --samples 'sample_name="CG1"' -t CG1
% vtools select variant --samples 'sample_name="CG2"' -t CG2
% vtools select variant --samples 'sample_name="Illumina1"' -t Illumina1
% vtools select variant --samples 'sample_name="Illumina2"' -t Illumina2
We can use vtools_report variant_stat
to calculate number of different types of variants in each table, or we can do that explicitly using commands
% for table in CG1 CG2 Illumina1 Illumina2
% do
vtools select \\(table 'alt!="-"' 'ref!="-"' 'length(alt)=1' 'length(ref)=1' -c
vtools select \\(table 'ref="-"' -c
vtools select \\(table 'alt="-"' -c
vtools select \\(table 'alt!="-"' 'ref!="-"' 'length(alt) > 1 OR length(ref) > 1' -c
% done
Of course, for the total number of variants in each table, you can simply call
% vtools show tables
To find common sets of variants, we can use command vtools compare
to compare tables:
% vtools compare CG1 CG2 -c
% vtools compare Illumina1 Ilumina2 -c
% vtools compare CG1 Illumina1 -c
% vtools compare CG2 Illumina2 -c
The first two commands compare two samples called using the same platform, the latter two commands compare variants from the same sample, called by different platforms. If you would like to know the exact differences, you can run commands such as
% vtools compare CG1 Illumina1 --difference CG_not_illumina
% vtools compare Illumina1 CG1 --difference Illumina_not_CG
to get lists of variants differentially called by two platforms.
Different quality scores are used for different platforms and variant types. The following command calculates the average quality scores per-sample, and adds six phenotypes to the sample table.
% vtools phenotype --from_stat 'mean_Q_indel=avg(Q_indel)' 'mean_Q_snv=avg(Q_max_gt)' \
'mean_VAF1=avg(allele1VarScoreVAF)' 'mean_VAF2=avg(allele2VarScoreVAF)' \
'mean_EAF1=avg(allele1VarScoreEAF)' 'mean_EAF2=avg(allele2VarScoreEAF)'