We have whole genome sequencing data for more than 18M variants. After some initial analysis, three genes geneA
, geneB
, and geneC
caught our eyes and we would like to further investigate these genes. We assume that we already have a project with all the variants imported
We first need to get a list of genes
% vtools use refGene
From the output
% vtools show annotation refGene
we can see that it has locations of each gene, gene ID (NM_XXX
) and a more common name name2
. To select variants that belong to these genes, we can do
% vtools select variant 'refGene.name2 in ("geneA", "geneB", "geneC")' -t gene_ABC
Or, if you prefer separating the tables, you can use commands
% for gene in geneA geneB geneC
% do
vtools select variant "refGene.name2='$gene'" -t \\(gene
% done
to create three tables. To create another table with variants in these three tables, you could do
% vtools compare geneA geneB --A_or_B gene_AB
% vtools compare gene_AB geneC --A_or_B gene_ABC
% vtools remove tables gene_AB
The last command remove the intermediate table gene_AB
.
To subset a dataset to variants within a group of hundreds of genes, you need to create a small annotation database to link the gene list to the project, via, e.g., the refGene database. Suppose your list of genes are formatted like:
% NEBL NM_213569
% NEBL NM_213569
% NEBL NM_213569
% NEBL NM_213569
% LINC00167 NR_024233
% PGAP2 NR_027016
% PGAP2 NR_027016
...
where the columns are gene name and (or) transcript name. You can create an annotation file for the gene list, e.g.,
[linked fields]
*=gene_name
[data sources]
description=My gene selection for xxx project
version=20130419
anno_type=field
source_type=txt
[gene_name]
index=1
type=VARCHAR(255)
comment=Gene name
[transcript_name]
index=2
type=VARCHAR(255)
comment=Transcript name
and the annotation database
% vtools use MyGenes.ann -f my_gene_list.txt --linked_by refGene.name2
Note that the default linked field is “gene_name”, the gene names. Alternatively you can choose to link by transcript name, e.g.
% vtools use MyGenes.ann -f my_gene_list.txt --linked_by refGene.name --linked_fields transcript_name
Finally select the variants belonging to this list of gene
% vtools select some_table "MyGenes.gene_name is not NULL" -t my_genes
Annotation database refGene_exon
lists all exonic regions of each gene. To use this annotation database, you can download and use the latest version of this database using command
% vtools use refGene_exon
Then select variants that belong to the exonic regions of the genes using command
% vtools select variant 'refGene_exon.name2 in ("geneA", "geneB", "geneC")' -t gene_ABC
Or separately for each gene:
% for gene in geneA geneB geneC
% do
vtools select variant "refGene_exon.name2='$gene'" -t \\({gene}_exon
% done
If you would like to find all variants that are in vicinity of genes (e.g. 2k basepair in the upstream and downstream of the gene), you will have to know the position of the genes. You could get such information from UCSC database, or from the refGene database using commands such as
% vtools execute 'SELECT chr, txStart, txEnd FROM refGene.refGene WHERE refGene.name2="geneA"'
This query is database dependent. You could use command sqlite3 annotation.DB .schame
to get the structure of database if you would like to know the structure of another annotation database and submit a similar query.
After you get the starting and ending locations of the genes, you could select a variants using commands such as
% vtools select variant 'chr="5"' 'pos>=7152445' 'pos<=7158882' -t geneA_ext