% vtools_report sequence -h
usage: vtools_report sequence [-h] [--build [BUILD]]
[--numbered [{left,right}]]
[--char_per_line CHAR_PER_LINE]
[--transcribe [GENE [GENE ...]]]
[--translate [GENE [GENE ...]]]
[--mark [MARK [MARK ...]]] [--mark_complement]
[--mark_reverse] [--hide_unmatched]
[-v {0,1,2,3}]
regions [regions ...]
positional arguments:
regions One or more chromosome regions in the format of chr
:start-end (e.g. chr21:33,031,597-33,041,570),
Field:Value from a region-based annotation database
(e.g. refGene.name2:TRIM2 or
refGene_exon.name:NM_000947), or set options of
several regions (&, |, -, and ^ for intersection,
union, difference, and symmetric difference). Several
regions will be printed if the name matches more than
one regions. Chromosome positions are 1-based and are
inclusive at both ends so the chromosome region has a
length of end-start+1 bp. A reversed complementary
sequence will be outputted if starting position is
after the ending position.
optional arguments:
-h, --help show this help message and exit
--build [BUILD] Output sequence at specified build of reference
genome. The primary reference genome of the project
will be used if by default.
--numbered [{left,right}]
If specified, add position of the first or last
basepair of each line to the left or right of the
line, and insert a space at every 10 basepair
--char_per_line CHAR_PER_LINE
Number of characters (excluding space and leading
numbers) per line. Default to 70 in regular and 60 in
numbered format.
--transcribe [GENE [GENE ...]]
Transcribe DNA sequence into RNA sequence. variant
tools will look in the refGene database, find all
genes that overlap with the region, locate exons and
coding regions, transcribe the DNA sequence to RNA
sequence (basically discard introns, and complement if
on reverse strand). The complete mRNA sequence will be
printed regardless of the bounaries of specified
regions. If one or more names (refGene.name) are
specified, only specified genes will be translated.
--translate [GENE [GENE ...]]
Translate DNA sequence into protein sequence. variant
tools will look in the refGene database, find all
genes that overlap with the region, locate exons and
coding regions, transcribe and translate the DNA
sequence to protein sequence. The complete protein
sequence will be printed regardless of the boundaries
of specified regions. If one or more names
(refGene.name) are specified, only specified genes
will be translated.
--mark [MARK [MARK ...]]
Mark a location (--mark chr pos), a variant (--mark
chr pos ref alt), a region (e.g.
refGene_exon.name:NM_000947), or a particular sequence
(e.g. TCGGA) in red in the output. If a variant is
specified, the changed nucleotide or amino acid will
be printed. Currently only single nucleotide
polymorphisms are supported.
--mark_complement If set, also try to mark the complement of the
specified sequence
--mark_reverse If set, also try to mark the reverse sequence of the
specified sequence. If both mark_complemnt and
mark_reverse are set, four different sequences will be
searched.
--hide_unmatched If set, only display regions with marked variants or
sequences
-v {0,1,2,3}, --verbosity {0,1,2,3}
Output error and warning (0), info (1), debug (2) and
trace (3) information to standard output (default to
1).
% vtools init test
% vtools_report sequence chr6:123456-123743 --build hg19
>ref|hg19|chr6:123456-123743
TTCCTGGAGTATTTTTCCCCTGACAAATTACTTATCATCTATCATAATTCAGGTTAAATGGCACTAACTC
AGGGAAGGCTTCCCTAACTGCCTCCCTTCTCCAACCAAATTAGGAACAATTATATGGCCACATAGTATCG
AATCAAGTTTATAATTTTAAAATAATTGGGAGATTTTGTTGTTTAACACTTGTTTTCACTATAAGACTGT
AATTACATGCAAGTAAGAACCATGCCTGTTTGTTCACTCCTGCCACAGTCAGAATAGTGCCTGGAATATG
CAGTAAGG
Here we need to explicitly specify a reference genome because the project is empty and does not have a primary reference genome.
You can also output multiple sequences, or sequences in reverse order (on the - strand) if you specify a starting position that is larger than the ending position.
% vtools_report sequence chr6:123456-123743 chr6:123743-123456 --build hg19
>ref|hg19|chr6:123456-123743
TTCCTGGAGTATTTTTCCCCTGACAAATTACTTATCATCTATCATAATTCAGGTTAAATGGCACTAACTC
AGGGAAGGCTTCCCTAACTGCCTCCCTTCTCCAACCAAATTAGGAACAATTATATGGCCACATAGTATCG
AATCAAGTTTATAATTTTAAAATAATTGGGAGATTTTGTTGTTTAACACTTGTTTTCACTATAAGACTGT
AATTACATGCAAGTAAGAACCATGCCTGTTTGTTCACTCCTGCCACAGTCAGAATAGTGCCTGGAATATG
CAGTAAGG
>ref|hg19|chr6:123456-123743 (reversed)
GGAATGACGTATAAGGTCCGTGATAAGACTGACACCGTCCTCACTTGTTTGTCCGTACCAAGAATGAACG
TACATTAATGTCAGAATATCACTTTTGTTCACAATTTGTTGTTTTAGAGGGTTAATAAAATTTTAATATT
TGAACTAAGCTATGATACACCGGTATATTAACAAGGATTAAACCAACCTCTTCCCTCCGTCAATCCCTTC
GGAAGGGACTCAATCACGGTAAATTGGACTTAATACTATCTACTATTCATTAAACAGTCCCCTTTTTATG
AGGTCCTT
A non-standard, but easier-to-read numbered format could be outputted using option --numbered
:
% vtools_report sequence chr6:123456-123743 chr6:123743-123456 --numbered --build hg19
>ref|hg19|chr6:123456-123743
123456 TTCCTGGAGT ATTTTTCCCC TGACAAATTA CTTATCATCT ATCATAATTC AGGTTAAATG
123516 GCACTAACTC AGGGAAGGCT TCCCTAACTG CCTCCCTTCT CCAACCAAAT TAGGAACAAT
123576 TATATGGCCA CATAGTATCG AATCAAGTTT ATAATTTTAA AATAATTGGG AGATTTTGTT
123636 GTTTAACACT TGTTTTCACT ATAAGACTGT AATTACATGC AAGTAAGAAC CATGCCTGTT
123696 TGTTCACTCC TGCCACAGTC AGAATAGTGC CTGGAATATG CAGTAAGG
>ref|hg19|chr6:123456-123743 (reversed)
123743 GGAATGACGT ATAAGGTCCG TGATAAGACT GACACCGTCC TCACTTGTTT GTCCGTACCA
123683 AGAATGAACG TACATTAATG TCAGAATATC ACTTTTGTTC ACAATTTGTT GTTTTAGAGG
123623 GTTAATAAAA TTTTAATATT TGAACTAAGC TATGATACAC CGGTATATTA ACAAGGATTA
123563 AACCAACCTC TTCCCTCCGT CAATCCCTTC GGAAGGGACT CAATCACGGT AAATTGGACT
123503 TAATACTATC TACTATTCAT TAAACAGTCC CCTTTTTATG AGGTCCTT
The default value for parameter --numbered
is left
. You can also put the numbers to the right (referring to the 1-based position of the last base pair of each line) using option --numbered right
,
% vtools_report sequence chr6:123456-123743 chr6:123743-123456 --numbered right --build hg19
>ref|hg19|chr6:123456-123743
TTCCTGGAGT ATTTTTCCCC TGACAAATTA CTTATCATCT ATCATAATTC AGGTTAAATG 123515
GCACTAACTC AGGGAAGGCT TCCCTAACTG CCTCCCTTCT CCAACCAAAT TAGGAACAAT 123575
TATATGGCCA CATAGTATCG AATCAAGTTT ATAATTTTAA AATAATTGGG AGATTTTGTT 123635
GTTTAACACT TGTTTTCACT ATAAGACTGT AATTACATGC AAGTAAGAAC CATGCCTGTT 123695
TGTTCACTCC TGCCACAGTC AGAATAGTGC CTGGAATATG CAGTAAGG 123743
>ref|hg19|chr6:123456-123743 (reversed)
GGAATGACGT ATAAGGTCCG TGATAAGACT GACACCGTCC TCACTTGTTT GTCCGTACCA 123684
AGAATGAACG TACATTAATG TCAGAATATC ACTTTTGTTC ACAATTTGTT GTTTTAGAGG 123624
GTTAATAAAA TTTTAATATT TGAACTAAGC TATGATACAC CGGTATATTA ACAAGGATTA 123564
AACCAACCTC TTCCCTCCGT CAATCCCTTC GGAAGGGACT CAATCACGGT AAATTGGACT 123504
TAATACTATC TACTATTCAT TAAACAGTCC CCTTTTTATG AGGTCCTT 123456
The regions could also be specified using field_name:value
of a linked annotation database. For example, if you have used annotation database refGene_exon
in your project, you could list the sequence of all exons in a gene using command
% vtools use refGene_exon
% vtools_report sequence refGene_exon.name2:ABRA -v0
>ref|hg19|chr8:107773742-107771711 (refGene_exon.name2:ABRA 1)
TACAAGAAAACATCTTTTATTCTTATGCTATTTGTCTTTAACATTTGGCGTAGGTTTCACTTGTAATAGA
AGTGTGCATCTTACATGAAATCCTACTTAGAAATATTGGTCATTAACTTATAGCTTTATAAATACGGTAT
GGAAACAAAATACGCATACTTTCTTAAGATACCAGAACCTTCAGTTTGAAAGAGACTCATACATGAAAAC
AAATATAATTCTTATAGCAGTCTAATACCTATTATTTTGAGGTATCATATTTAGACATTAATTTTTACAT
TTAACACCAAAGTGCTATACTATCAATGCATTTTAGTAATTTCATTGAATCCAGTAAGAATTGCAATCTC
ACATTCCATTCCATTTGTTTTTAAGATCACATAGGCACATTAATGTCCAATGATTTTCTCCTTGCCATCA
TCATTGTATATTTTTTACTTTGGCAGAACTAATATTGCAGTTTTATGACTTAAACATTTCATAAAGTTTT
CTTTGTTACAGTATGAGGAACATTAAACTGTCTCAAATCTGTTTTTAATTCACCAATGATTCTATGGGAT
TTTTCTAATACTGAACATTCCACTTCATAGATTTTCTGTATTTCTCCATTGTTGGAGAAATACATATCAT
TGCATTGATGTTAATGCAAGCAGCTTGACTCGTTATATGCTGACATGTAATGAAACAATAAACTTCTCCA
AGAAGTCCTACAAAGGCAAGTAAATCTGAATCTCTCACCTCTGCTCTACTTTTAGTGAGGGTGTGCCTTC
CTTTGTTGGTACTAAGACAGGAAGTGTCTGTTTTTATGTGATTTGCTGAAATCAAACAAGAGGAAGGTGA
ACTCTTACTGGAAACCTGTTAGAATCCAAGCCCACAGAGGAACTTCAACAATTTCAGCAACATCTTGACA
ACTTTTTATATATGTGCTTGTTTTGTAAGTGATGCTATGATATAAACAATAATACAAAGTCAAGCTTCTG
AAAAAATTTTAAGTTTAGTTTTACCAAACTTAATGGTATTTTCTTTTTTCCAAATAAAGACACATTATTA
TATGTTCTACTGATATTTATCTCAAGTCAGCATTCTTCTTAGATGTAATGTTTAATTAAAAATCTATAGT
AGTTTCTACAATGACTAGCTAAATATTTTGCCTAAAATATACAAGTGGCTGGGTGCTGTTGCTCATGCCT
ATAATCCCAGCTACTTGGGAGGCTGAGGTGAGAGGATCACTTGAGGCCAGGAGTTAAAAGACCAGCTTTG
GTGACATAAAGAGAATCTGTCTCAAAACAAAAACAAAAACACCCTGTGGAATTTCAACTATTAATACTAT
TATTATACATTAACATTCTATGTGCTTTCTGAAATGCTTTGTGCCTTCTCAAAATCTCCAAAATTCCTCA
TTCTAGATAGAAATAGAATGCCAGAAAGTCGTTTATGTAAAAATTGTTTAATATTTACTAACTTATTACA
GAGAGTTTGCATTTTCATTTTACCTACATGAGCATTTAGCATTAAGACCATAGTGGGCCAAATTTGGCTT
TTGTTTTTGAAGGTTCACTTGAGTAGCGTAATCACAACATGGTCATCTCGGCCTTGCCATAGCATCTCTC
CTTCAAAGTCTACCAGTCCATGTTTCCTGGCACGCATGAGAATGCCCACTACTTTATCTGAAATACGAAC
GTATCTGTCAAAGAGATCTCCAAAAGTAACCTGGATCTTGCCATCTCGTCTGTGGCGAGCCATTGTGCAG
ATAATGAAGCACATGTCCATCATTTCCCTGTAGATGTGCTCCTCAGCACGCTTGGCCCTTTCAGCAGTTT
TGGTTCCTTCTTTGGGGCGGCCATAGCCCTCATCTCCTTTGTGTAGGCGGGTGGACATGGCCAGCTCGTA
ATCAAACTCTTCACTGAAAGGATTGAGCTTCTGGGATTGTATGTGTTCATCAGCCCACTGCTGCCATCTC
CCTTTCAAGTTGCCCACTGGGCTATATTTCTGTTGGGCTTTGCAGTTGAGTTTCTCTGTAAATCTGTTTA
CC
>ref|hg19|chr8:107782472-107781751 (refGene_exon.name2:ABRA 2)
TGGGAGGGCAAGGGGCGCTTGATCCTGACCACAGCCACCTGCACTCCATCCTGCTCGGGCCTCTCCTCAG
CCTCTCCTCCATAGCCGCTGTCCTCTGTGTCTACGCTGTCACTCCTCCATGTGGGCTCCTCCTGCTCCAT
CACTCTCCAGCCCTTGGTTAGCTCAGACACCAGGTTGGCACATTTTCTCCTCCGCGTTGGGGAGCCGTGG
CTGTGGAGGATTCTGTCAATGTCATTCTCTGGCTGCCCAGGTTCAAGCACACCAGCATCCCTCTCGTACC
TGTGGCTGAGGTGGCTCACGTCCCCTCCTCTCTCATAAGTCTTGCTGACCACCGTTTTGGACACCTCTTT
CTTTTTGATGTGAGAAACCTCAGGGGCTTTCTCTGAGCTTTGTCCATCTCCATGTCCTTCTGGCAGGCGG
GGTGGCGACTTTGGGGCACTCTGAGCTTTCTGGTGTGAAGTAGGGGGTGTGATTGGTTTAGGAGCTTGAG
GTGAGTCCTGGGTCCCTCCCGGCAGCCAGCCTGTAGGCTCCTGGGCCTGCCTGATGCTGTTCTCATTCGC
CCACTGCTGCCAACCTCGGGCCAAGCTGATGACCAGGGTGGCTGTGCGTATCTTCCGGAGGGCGCTCTTG
GCTGGGCCCTCCCCGCTTTCCTTTTCGCCCGGAGCCATGCTGCCCACCTGTCTTTCTCTGCTGATAGCCT
GGACACTGGCTAAAATGAGTGT
You can specify more complex regions using set options between regions (e.g. refGene.name2:ABRA & chr8:107782472-1077800
). For example, the following command lists sequences of introns of gene BRCA1
.
% vtools use refGene
% vtools use refGene_exon
% vtools_report sequence 'refGene.name2:BRCA1-refGene_exon.name2:BRCA1'
>ref|hg19|chr17:41197820-41199659
CTGGAGACAGAGAACACAAGCAGAGATTAGTGTCAATTCATTCTCCTGGACTAGGCTCTAATCAATCGAC
TCCAGGGTCCTGGTTGTATGAGTTCTTAGGATTAATGAGGTAGAAGCTAATTTTTTTTTTTTTTTTTTGA
GACGGAGTCTTGCTCTGTCGCCGAGGCTAGAGTGTGATGGCGCAATCTCGGCTCATTCAACCTCCGCCTC
CTGGGTTCAAGCAATTCTCCTGTCTCTGCCTCCTGAGTAGCTGGAATTACAGGCACATGCCATCACACCC
AGCTAATTTTTGTATTTTTAGTAGAGACGGGGGTTTCACAATGTTGGCCAGGCTGCTCTGGAACTCCTGA
CCTCAGGTGATCCACCCACCTTGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACTGCACCTGGCC
TTTTTTTTTTTTTTTTTTTTTTTTTGAGACGGAGTCTTGCTCTTGTTGCTCAGCCTGGAATGCAATGGCA
CGATCTCAGCTCACTGCAACCTCCACCTCCCGGGTTCAAGCAATTCTCCTGCCTCAGCCTCCCAAGTAGC
AGGGATTACAGGTGCCTGCCACCATGCCAGGCTAATTGTTTTTTCTTTTTTTTCAGATGGAGTCTCACTC
TGTCACTCAGGCTGGATTGTGATGGTGTGATCTCAGCTCACTGCAACCTCAACATCCTGGGTTCAAGCGA
TTCTCCTGCCTCAGTCTCCCAAGTAGCTGGGACTACAAGTGCGTGCCACCATGCCTGGCTAATTTTTTTT
AGTATTTTTAGTAGAGATGGGGTTTCGCCATATTGGCCAGGCTGGTCTCAAACTCCTGATGTCAGGTGAT
CCGCCCTGAGGCTGAGGCAGGAGAATCATTTAAACCCAGGAGGCGGAGGTTGCAGTGAGCCAAGACTGGG
CCACTGCACTCCAGCCTGCTAAGTGACAGAGTGAGACTCCACCTCAAAAAAAAAAAAAAAAGGCAATGCT
TCAGGACATAAGGCCTTGCTCTGAAGAGGCCCTAGGAGTGACTCCTGGTGACAGTGAAAGCCCACAGCCT
CTGGCAACTGTATTAACATGAACTTCAATCTGTTAAAGGAAAGCCACCAGGAAAACAGCACTGTAATTTA
ACGATGTGGAAAAATGTATGTAATATCTTAAGGAAAAAAGCAAAACAGTGTAATTATGATCACATTTTAT
AAAATACACGTGTATATATACGCACATATGCCTGGTGGAGTTTTATGGTGATCATCTCCAAGTGGTGGAA
TTACTGGGATTATTTTATTGTTTTTGTGTAAATTTATACTTTCTTTTTTCTTTTTGAGACACGGTCTCGC
TCTGTCGCCCAGGCTGGAGTACAGTGGTGTGATCGTGGCTCACTGAAGCATCAACCTCCTGAGCTCAAGT
GATCCTCCCACCTCAGCTTCCCAAGTAGCTGCGACTACAGGCATCTGCCACCACACCCAGCTACTTTTTA
AATTTGTTGTACAGATGAAGTCTCCTTATGTTGCCCAGGCTGGTCTCGAACTTCTAGGCTCCCACCTTGA
CCTCCATCTTGACCTCCCAAAGTGCTGGAATTATAGGCATGAGCCACCATGCCCGGCCTTGATTTATGTT
TTTGTGATGAACATTCATATCTTACTCCCACCCCATGGAAACAGTTCATGTATTACTTTTACAATATAAA
ACAAATAACAATAAAAACATCAAAAAGACATTTTAGCCATTCATTCAACAAATATTTAAAATGTGCCAAG
AACTGTGCTACTCAAGCACCAGGTAATGAGTGATAAACCAAACCCATGCAAAAGGACCCCATATAGCACA
GGTACATGCAGGCACCTTAC
>ref|hg19|chr17:41199721-41201137
CTGGATCCCCAGGAAGGAAAGAGCATTCAAAGTGTCAAAGTAGGACTACTGGAACTGTCACTTCATCATT
TTTTTTGTTTGTTTTTGAGACAGGGTCTTGCTCTGTCACCCAGGCTGGAGTGCAGTGGTGTGATCTCAGC
TCACTGCCACCTCTGCCTCCTGGGCTCAAGCAATCCTTCCATCTCAGCCTCCTAAGTAGCTGGAACTACA
GACACGTACCACCACCCCTGGCTAATTTTTTTGTATTTTTGGTAGAGACAGGGTTTTGCCATGTTGCCCA
GGCTGGTCTCAAACTCCTGGGCTCAACTTCACCCCCGGGATTATAGGCATGAGCCACCGCACCCAGCCTT
GGCTAATTTTTAATAATTTTTTTGTAGACATGAGGTCCTACTGTATTGCCCAGGCTGGTCTTCAGCTCCC
AGGCTCAAGCGATTCTCCCACCTTGGCCTCCCAGTGTTGTGATTACAGGGGTGGGGCACTGGCCCAGCCC
ATCATTTCTCTCTCTCTCTTTTTTTTTGAGACGGAGTCTCGCTCTGTCGCCCGGGCTGGAGTGCAGTGGC
GCGATCTTGGCTCACTGCAACCTCCGCCTCCGGGGTTCAAGCGATTCTCCTGCCCCAGCCCCTCAAGTAG
CTGGGACTACAGGCGTGCGCCCCTACGCCCAGCTAATTTTTGTATTTTTAGTAGAGACGGGGTTTCGCCA
TGTTGGTTGGCCAGGATGGTCTCGATCTCTTGACCTCGTGATCTGCCCACCTCAGCCTCCCAAAGTGCTG
GGATTACAGGCGTGAGCCACCGCACCTAGCTTTTCTCTCTCTCTCTTTTTTTTTTTTTTTAGACAAAGTC
TCACTCTGTCACCCAGTCTGGAGTGCAGTGGTGCAATCTTGGCTCACTGCAACCTCTGCCTCCCACGTTC
AAGCGATGCTCACACCTCAACTTCCCAAATAGCTGGCATTACAGGCATGCTCCACCAGGCCTGGCTACTT
TTTGTTTTTTTTTTTTTAGTACAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCAAACTCCTGACAA
GTGATCCACCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGACATGAGCCACCATGCCCAGCCTCCAGC
CCATCATTTCTTGATGATTTGTTGAAACACAGTATGCTGGGGCAGTCACAGAGAGGAGGGGGAGGGACAT
ATGGGAAAAAGAGTTAGAGGGAAAAAGTCTTCCCTCAGTATATTTAATATGTGCAGTTCTCAAATCCTTA
CCCATCCCTTACAGATGGAGTCTTTTGGCACAGGTATGTGGGCAGAGAAGACTTCTGAGGCTACAGTAGG
GGCATCCATAGGGACTGACAGGTGCCAGTCTTGCTCACAGGAGAGAATATTGTGTCCTCCCTCTCTGACA
GGGCACCCAATACTTAC
>ref|hg19|chr17:41201212-41203079
CTAAAATGGACATTTAGATGTAAAATCACTGCAGTAATCTGCATACTTAACCCAGGCCCTCTACCCTACA
CTCTCCGGATGAAGGCTTATAGCAAGACCTCTCAATGGGAGAGTCTGTCTCTCTGCTCCAAAGGACAATG
GTCTTAAAATAGTAGGGGTATGGATTTTAAGTCAATTTGCCACTGATATGCCATGTACTCTGGTTATCAG
TCTCCATAAGGCCACTTGGTATAAGGTTTGATAGTCTCTCAAATAAAATGCTTGAAAGAAAAAAAAATCA
AAGATCTAATTTCCATTAATTTGCTAAATTGCTGGCTAAGACACTGTGTGAAAAAACACCCACCTTCCTT
...
Here you can see that from the 5’ to 3’, each intron starts with CT
and ends with AC
, which corresponds to the GU
–>AG
signature of introns of RNA.
The vtools_report sequence
command can also be used to output mRNA sequences of genes that overlap with the specified regions. Because mRNA consists pieces of regions (coding sequences in exons) and only complete mRNA sequence could be translated, this option outputs complete RNA sequence of the genes (isoforms, as in field refGene.name
). The coding regions will be listed after the sequence name.
For example, you can print output the RNA sequence of gene ABRA
using command
% vtools_report sequence refGene.name2:ABRA --transcribe --numbered
>mRNA|NM_139166|chr8:107782418-107781751,107773742-107773265 (1146 bp on reverse strand)
1 ATGGCTCCGG GCGAAAAGGA AAGCGGGGAG GGCCCAGCCA AGAGCGCCCT CCGGAAGATA
61 CGCACAGCCA CCCTGGTCAT CAGCTTGGCC CGAGGTTGGC AGCAGTGGGC GAATGAGAAC
121 AGCATCAGGC AGGCCCAGGA GCCTACAGGC TGGCTGCCGG GAGGGACCCA GGACTCACCT
181 CAAGCTCCTA AACCAATCAC ACCCCCTACT TCACACCAGA AAGCTCAGAG TGCCCCAAAG
241 TCGCCACCCC GCCTGCCAGA AGGACATGGA GATGGACAAA GCTCAGAGAA AGCCCCTGAG
301 GTTTCTCACA TCAAAAAGAA AGAGGTGTCC AAAACGGTGG TCAGCAAGAC TTATGAGAGA
361 GGAGGGGACG TGAGCCACCT CAGCCACAGG TACGAGAGGG ATGCTGGTGT GCTTGAACCT
421 GGGCAGCCAG AGAATGACAT TGACAGAATC CTCCACAGCC ACGGCTCCCC AACGCGGAGG
481 AGAAAATGTG CCAACCTGGT GTCTGAGCTA ACCAAGGGCT GGAGAGTGAT GGAGCAGGAG
541 GAGCCCACAT GGAGGAGTGA CAGCGTAGAC ACAGAGGACA GCGGCTATGG AGGAGAGGCT
601 GAGGAGAGGC CCGAGCAGGA TGGAGTGCAG GTGGCTGTGG TCAGGATCAA GCGCCCCTTG
661 CCCTCCCAGG TAAACAGATT TACAGAGAAA CTCAACTGCA AAGCCCAACA GAAATATAGC
721 CCAGTGGGCA ACTTGAAAGG GAGATGGCAG CAGTGGGCTG ATGAACACAT ACAATCCCAG
781 AAGCTCAATC CTTTCAGTGA AGAGTTTGAT TACGAGCTGG CCATGTCCAC CCGCCTACAC
841 AAAGGAGATG AGGGCTATGG CCGCCCCAAA GAAGGAACCA AAACTGCTGA AAGGGCCAAG
901 CGTGCTGAGG AGCACATCTA CAGGGAAATG ATGGACATGT GCTTCATTAT CTGCACAATG
961 GCTCGCCACA GACGAGATGG CAAGATCCAG GTTACTTTTG GAGATCTCTT TGACAGATAC
1021 GTTCGTATTT CAGATAAAGT AGTGGGCATT CTCATGCGTG CCAGGAAACA TGGACTGGTA
1081 GACTTTGAAG GAGAGATGCT ATGGCAAGGC CGAGATGACC ATGTTGTGAT TACGCTACTC
1141 AAGTGA
The vtools_report sequence
command can also be used to output protein sequences of genes that overlap with the specified regions. Because mRNA consists pieces of regions (coding sequences in exons) and only complete mRNA sequence could be translated, this option outputs complete protein sequence of the genes (isoforms, as in field refGene.name
). The coding regions will be listed after the sequence name.
For example, you can print output the protein sequence of gene ABRA
using command
% vtools_report sequence refGene.name2:ABRA --translate --numbered
>protein|NM_139166|chr8:107773265-107773742,107781751-107782418 (1146 bp)
1 MAPGEKESGE GPAKSALRKI RTATLVISLA RGWQQWANEN SIRQAQEPTG WLPGGTQDSP
61 QAPKPITPPT SHQKAQSAPK SPPRLPEGHG DGQSSEKAPE VSHIKKKEVS KTVVSKTYER
121 GGDVSHLSHR YERDAGVLEP GQPENDIDRI LHSHGSPTRR RKCANLVSEL TKGWRVMEQE
181 EPTWRSDSVD TEDSGYGGEA EERPEQDGVQ VAVVRIKRPL PSQVNRFTEK LNCKAQQKYS
241 PVGNLKGRWQ QWADEHIQSQ KLNPFSEEFD YELAMSTRLH KGDEGYGRPK EGTKTAERAK
301 RAEEHIYREM MDMCFIICTM ARHRRDGKIQ VTFGDLFDRY VRISDKVVGI LMRARKHGLV
361 DFEGEMLWQG RDDHVVITLL K*
A gene can have multiple isoforms. The --translate
option will print out protein sequences of all isoforms, unless you list the isoforms of interest after parameter --translate
. For example,
% vtools_report sequence refGene.name2:BRCA1 --number --translate NM_007300 NM_007299
>protein|NM_007299|chr17:41276113-41276034,41267796-41267743,41258550-41258473,41256973-41256885,41256278-41256139,41251897-41251792,41249306-41249261,41247939-41247863,41246877-41246761,41243049-41242961,41234592-41234421,41228628-41228505,41226538-41226348,41223255-41222945,41219712-41219625,41215968-41215891,41215390-41215350,41209152-41209069,41203134-41203080,41199720-41199660,41197819-41197801 (2100 bp on reverse strand)
1 MDLSALRVEE VQNVINAMQK ILECPICLEL IKEPVSTKCD HIFCKFCMLK LLNQKKGPSQ
61 CPLCKNDITK RSLQESTRFS QLVEELLKII CAFQLDTGLE YANSYNFAKK ENNSPEHLKD
121 EVSIIQSMGY RNRAKRLLQS EPENPSLQET SLSVQLSNLG TVRTLRTKQR IQPQKTSVYI
181 ELGSDSSEDT VNKATYCSVG DQELLQITPQ GTRDEISLDS AKKAACEFSE TDVTNTEHHQ
241 PSNNDLNTTE KRAAERHPEK YQGEAASGCE SETSVSEDCS GLSSQSDILT TQQRDTMQHN
301 LIKLQQEMAE LEAVLEQHGS QPSNSYPSII SDSSALEDLR NPEQSTSEKV LTSQKSSEYP
361 ISQNPEGLSA DKFEVSADSS TSKNKEPGVE RSSPSKCPSL DDRWYMHSCS GSLQNRNYPS
421 QEELIKVVDV EEQQLEESGP HDLTETSYLP RQDLEGTPYL ESGISLFSDD PESDPSEDRA
481 PESARVGNIP SSTSALKVPQ LKVAESAQSP AAAHTTDTAG YNAMEESVSR EKPELTASTE
541 RVNKRMSMVV SGLTPEEFML VYKFARKHHI TLTNLITEET THVVMKTDAE FVCERTLKYF
601 LGIAGGKWVV SYFWVTQSIK ERKMLNEHDF EVRGDVVNGR NHQGPKRARE SQDRKIFRGL
661 EICCYGPFTN MPTGCPPNCG CAARCLDRGQ WLPCNWADV*
>protein|NM_007300|chr17:41276113-41276034,41267796-41267743,41258550-41258473,41256973-41256885,41256278-41256139,41251897-41251792,41249306-41249261,41247939-41247863,41246877-41243452,41243049-41242961,41234592-41234421,41231416-41231351,41228628-41228505,41226538-41226348,41223255-41222945,41219712-41219625,41215968-41215891,41215390-41215350,41209152-41209069,41203134-41203080,41201211-41201138,41199720-41199660,41197819-41197695 (5655 bp on reverse strand)
1 MDLSALRVEE VQNVINAMQK ILECPICLEL IKEPVSTKCD HIFCKFCMLK LLNQKKGPSQ
61 CPLCKNDITK RSLQESTRFS QLVEELLKII CAFQLDTGLE YANSYNFAKK ENNSPEHLKD
121 EVSIIQSMGY RNRAKRLLQS EPENPSLQET SLSVQLSNLG TVRTLRTKQR IQPQKTSVYI
181 ELGSDSSEDT VNKATYCSVG DQELLQITPQ GTRDEISLDS AKKAACEFSE TDVTNTEHHQ
241 PSNNDLNTTE KRAAERHPEK YQGSSVSNLH VEPCGTNTHA SSLQHENSSL LLTKDRMNVE
301 KAEFCNKSKQ PGLARSQHNR WAGSKETCND RRTPSTEKKV DLNADPLCER KEWNKQKLPC
361 SENPRDTEDV PWITLNSSIQ KVNEWFSRSD ELLGSDDSHD GESESNAKVA DVLDVLNEVD
421 EYSGSSEKID LLASDPHEAL ICKSERVHSK SVESNIEDKI FGKTYRKKAS LPNLSHVTEN
481 LIIGAFVTEP QIIQERPLTN KLKRKRRPTS GLHPEDFIKK ADLAVQKTPE MINQGTNQTE
541 QNGQVMNITN SGHENKTKGD SIQNEKNPNP IESLEKESAF KTKAEPISSS ISNMELELNI
601 HNSKAPKKNR LRRKSSTRHI HALELVVSRN LSPPNCTELQ IDSCSSSEEI KKKKYNQMPV
661 RHSRNLQLME GKEPATGAKK SNKPNEQTSK RHDSDTFPEL KLTNAPGSFT KCSNTSELKE
721 FVNPSLPREE KEEKLETVKV SNNAEDPKDL MLSGERVLQT ERSVESSSIS LVPGTDYGTQ
781 ESISLLEVST LGKAKTEPNK CVSQCAAFEN PKGLIHGCSK DNRNDTEGFK YPLGHEVNHS
841 RETSIEMEES ELDAQYLQNT FKVSKRQSFA PFSNPGNAEE ECATFSAHSG SLKKQSPKVT
901 FECEQKEENQ GKNESNIKPV QTVNITAGFP VVGQKDKPVD NAKCSIKGGS RFCLSSQFRG
961 NETGLITPNK HGLLQNPYRI PPLFPIKSFV KTKCKKNLLE ENFEEHSMSP EREMGNENIP
1021 STVSTISRNN IRENVFKEAS SSNINEVGSS TNEVGSSINE IGSSDENIQA ELGRNRGPKL
1081 NAMLRLGVLQ PEVYKQSLPG SNCKHPEIKK QEYEEVVQTV NTDFSPYLIS DNLEQPMGSS
1141 HASQVCSETP DDLLDDGEIK EDTSFAENDI KESSAVFSKS VQKGELSRSP SPFTHTHLAQ
1201 GYRRGAKKLE SSEENLSSED EELPCFQHLL FGKVNNIPSQ STRHSTVATE CLSKNTEENL
1261 LSLKNSLNDC SNQVILAKAS QEHHLSEETK CSASLFSSQC SELEDLTANT NTQDPFLIGS
1321 SKQMRHQSES QGVGLSDKEL VSDDEERGTG LEENNQEEQS MDSNLGEAAS GCESETSVSE
1381 DCSGLSSQSD ILTTQQRDTM QHNLIKLQQE MAELEAVLEQ HGSQPSNSYP SIISDSSALE
1441 DLRNPEQSTS EKDSHIHGQR NNSMFSKRPR EHISVLTSQK SSEYPISQNP EGLSADKFEV
1501 SADSSTSKNK EPGVERSSPS KCPSLDDRWY MHSCSGSLQN RNYPSQEELI KVVDVEEQQL
1561 EESGPHDLTE TSYLPRQDLE GTPYLESGIS LFSDDPESDP SEDRAPESAR VGNIPSSTSA
1621 LKVPQLKVAE SAQSPAAAHT TDTAGYNAME ESVSREKPEL TASTERVNKR MSMVVSGLTP
1681 EEFMLVYKFA RKHHITLTNL ITEETTHVVM KTDAEFVCER TLKYFLGIAG GKWVVSYFWV
1741 TQSIKERKML NEHDFEVRGD VVNGRNHQGP KRARESQDRK IFRGLEICCY GPFTNMPTDQ
1801 LEWMVQLCGA SVVKELSSFT LGTGVHPIVV VQPDAWTEDN GFHAIGQMCE APVVTREWVL
1861 DSVALYQCQE LDTYLIPQIP HSHY*
The output of command vtools_report sequence
can be long and it could be difficult for you to locate a particular location or region that is of interest. If this is the case, you can use option --mark
to mark them in the output.
The mark
parameter accept three types of inputs
chr12 12345
or 12 12345
.chr 12 12345 C T
. In this case the alternative nucleotide or the translated amino acid will be marked and printed. The function does not yet support the translation of DNA sequences affected by indels so only SNVs are allowed in this option.chr12:12345-12355
or refGene_exon.name2:WIF1
.For example, if you are interested in the location of a variant in the protein sequence,
vtools_report sequence refGene.name2:WIF1 --build hg19 --translate --numbered --mark 12 65449852
>protein|NM_007191|chr12:65514971-65514824,65514336-65514197,65471634-65471526,65462684-65462544,65461570-65461475,65460516-65460421,65456356-65456261,65449906-65449811,65448993-65448898,65445250-65445129 (1140 bp on reverse strand)
1 MARRSAFPAA ALWLWSILLC LLALRAEAGP PQEESLYLWI DAHQARVLIG FEEDILIVSE
61 GKMAPFTHDF RKAQQRMPAI PVNIHSMNFT WQAAGQAEYF YEFLSLRSLD KGIMADPTVN
121 VPLLGTVPHK ASVVQVGFPC LGKQDGVAAF EVDVIVMNSE GNTILQTPQN AIFFKTCQQA
181 ECPGGCRNGG FCNERRICEC PDGFHGPHCE KALCTPRCMN GGLCVTPGFC ICPPGFYGVN
241 CDKANCSTTC FNGGTCFYPG KCICPPGLEG EQCEISKCPQ PCRNGGKCIG KSK
CKCSKGY
301 QGDLCSKPVC EPGCGAHGTC HEPNKCQCQE GWHGRHCNKR YEASLIHALR PAGAQLRQHT
If you are interested in the change of amino acid, you can specify reference and alternative alleles,
vtools_report sequence refGene.name2:WIF1 --build hg19 --translate --numbered --mark 12 65449852 C A
>protein|NM_007191|chr12:65514971-65514824,65514336-65514197,65471634-65471526,65462684-65462544,65461570-65461475,65460516-65460421,65456356-65456261,65449906-65449811,65448993-65448898,65445250-65445129 (1140 bp on reverse strand)
1 MARRSAFPAA ALWLWSILLC LLALRAEAGP PQEESLYLWI DAHQARVLIG FEEDILIVSE
61 GKMAPFTHDF RKAQQRMPAI PVNIHSMNFT WQAAGQAEYF YEFLSLRSLD KGIMADPTVN
121 VPLLGTVPHK ASVVQVGFPC LGKQDGVAAF EVDVIVMNSE GNTILQTPQN AIFFKTCQQA
181 ECPGGCRNGG FCNERRICEC PDGFHGPHCE KALCTPRCMN GGLCVTPGFC ICPPGFYGVN
241 CDKANCSTTC FNGGTCFYPG KCICPPGLEG EQCEISKCPQ PCRNGGKCIG KSK
FKCSKGY
301 QGDLCSKPVC EPGCGAHGTC HEPNKCQCQE GWHGRHCNKR YEASLIHALR PAGAQLRQHT
If you would like to mark the exon regions of this gene, you can do
vtools_report sequence refGene.name2:WIF1 --build hg19 --numbered --mark refGene_exon.name2:WIF1
`>ref|hg19|chr12:65444404-65515346 (refGene.name2 1)`
`65444404 <span style='color: red;'>GCTCAGAAAA CTAAAGCAGC ACCTTTATTT TATACATACA AACAGTATAA AATGTTTATT</span>`
`65444464 <span style='color: red;'>AGGTAAGAGC TGTGTTTTGT TTACAATATA TTATATTGCT TCAAGCCAAT GCAAAAAGTT</span>`
`65444524 <span style='color: red;'>CATACATTAT ATTCCCTATT TCATTGTGTT TAGAATATAT TATATTGTTT AAATGCCACT</span>`
`65444584 <span style='color: red;'>ACCACAGTGT AATTTTTTTT TTTTTAATAC TGAATCTCTG GAATAATGGT AAGGTCAAAA</span>`
`65444644 <span style='color: red;'>TATATTGTAT TGAGAGTTTA AAAATTAAGA GCAATTTTTA AAAATGTAAC AAACATCTAA</span>`
`65444704 <span style='color: red;'>ATATCTGACA ATAAAATCTG AAATGCTGTA ACTTCAACAT TAACTGCACC ATCCAAATTC</span>`
`65444764 <span style='color: red;'>TTGTGACTTA CGCATTTTTG CCCAATTTAA CCTTTCTGAT GTTCCCCTGC CCCCAGACAC</span>`
`65444824 <span style='color: red;'>CATAAATGCA TTGTAATTTT GAAAATATCT GCCAACTACA CACTGAAAAT TTTAACCTGA</span>`
`65444884 <span style='color: red;'>TCAATTGACA TAATATAAAA TCTGTCCCAA AGCACTGAAA CAAGAAAATC TATACCATCA</span>`
`65444944 <span style='color: red;'>TGCTACAGAC GTACTTAGAA AACTTAAAAG GAAGAGTAAA TATCAGCTCA GTGATTTATA</span>`
`65445004 <span style='color: red;'>ATGAAGCTAA TAAAATTCAG GCCAGTATTC TTAAGTGTAA TGAACATTAT TTGAACATTC</span>`
`65445064 <span style='color: red;'>AACACATGAA AGGTTAACAA AGGCTATGAA CTTGGTGTAA CTTAAAACGT TTCAGATGTC</span>`
`65445124 <span style='color: red;'>GGAGTTCACC AGATGTAATT GGATTCAGGT GGATCCCGCC GCTCCTCGGC CTTTTTAAGT</span>`
`65445184 <span style='color: red;'>GAAGGCGTGT GCTGCCTGAG CTGGGCGCCT GCTGGCCTCA GGGCATGTAT GAGGCTGGCT</span>`
`65445244 <span style='color: red;'>TCG</span>TACCCTG CAAAATTATT CACAGCTTAA AAGGAAAGAA ACTACAAAAC CAGGCTCTTC`
`65445304 AGATTCATTA TTAACCATTC AAATGAAATC CATTTTTGCC TTTTCAATCC AGAATGAAAA`
`....`
`65514844 <span style='color: red;'>TGAGCATCGA TCCATAGGTA CAGGCTCTCC TCCTGCGGCG GCCCGGCCTC CGCCCGCAGT</span>`
`65514904 <span style='color: red;'>GCCAGCAGGC ACAGGAGGAT GCTCCAGAGC CAGAGCGCGG CGGCAGGGAA GGCGCTCCTC</span>`
`65514964 <span style='color: red;'>CGGGCCATGC TGCTCAGGAC CTCCTCGCTG CCGGGAAAAC TCCTCGTGCC GCACCTACGC</span>`
`65515024 <span style='color: red;'>AACCTGGCGC CGTCAGATAC TCTGCTGCGC TGCAGCTCCC TCAGCCAGGG CTGTTCCCGT</span>`
`65515084 <span style='color: red;'>TTAGACGGCT GGGCGCGTCG CCTCCCGGCC TGGGTGCCAA GGAGCCTGCG AGAGCAGGAG</span>`
`65515144 <span style='color: red;'>GAGGGGGCAG GCAGGACGCG GGAGGAGGAG GGGGGGACCA GCGGGCGCGA GTCGCGCAAG</span>`
`65515204 <span style='color: red;'>AAGATGGGGC GCGGGCGGGA GTGGGGGCGG CAGAGACGTA AGACTGGCAA AGCTGGCGAG</span>`
`65515264 <span style='color: red;'>GCCAGCAGTC AGCGGGGCAA ATAGAGCGAG AACAGAAGAG CGGGAAGGGC TGGCGCGAGC</span>`
`65515324 <span style='color: red;'>GAGGTGCGAG CGAGGAGTGG GGC</span>`
Finally, if you are looking for a particular sub-sequence of the output, you can try to mark the sequence
% vtools_report sequence refGene.name2:ABRA --transcribe --numbered --build hg19 --mark CTGCAC
>mRNA|NM_139166|chr8:107782418-107781751,107773742-107773265 (1146 bp on reverse strand)
1 ATGGCTCCGG GCGAAAAGGA AAGCGGGGAG GGCCCAGCCA AGAGCGCCCT CCGGAAGATA
61 CGCACAGCCA CCCTGGTCAT CAGCTTGGCC CGAGGTTGGC AGCAGTGGGC GAATGAGAAC
121 AGCATCAGGC AGGCCCAGGA GCCTACAGGC TGGCTGCCGG GAGGGACCCA GGACTCACCT
181 CAAGCTCCTA AACCAATCAC ACCCCCTACT TCACACCAGA AAGCTCAGAG TGCCCCAAAG
241 TCGCCACCCC GCCTGCCAGA AGGACATGGA GATGGACAAA GCTCAGAGAA AGCCCCTGAG
301 GTTTCTCACA TCAAAAAGAA AGAGGTGTCC AAAACGGTGG TCAGCAAGAC TTATGAGAGA
361 GGAGGGGACG TGAGCCACCT CAGCCACAGG TACGAGAGGG ATGCTGGTGT GCTTGAACCT
421 GGGCAGCCAG AGAATGACAT TGACAGAATC CTCCACAGCC ACGGCTCCCC AACGCGGAGG
481 AGAAAATGTG CCAACCTGGT GTCTGAGCTA ACCAAGGGCT GGAGAGTGAT GGAGCAGGAG
541 GAGCCCACAT GGAGGAGTGA CAGCGTAGAC ACAGAGGACA GCGGCTATGG AGGAGAGGCT
601 GAGGAGAGGC CCGAGCAGGA TGGAGTGCAG GTGGCTGTGG TCAGGATCAA GCGCCCCTTG
661 CCCTCCCAGG TAAACAGATT TACAGAGAAA CTCAACTGCA AAGCCCAACA GAAATATAGC
721 CCAGTGGGCA ACTTGAAAGG GAGATGGCAG CAGTGGGCTG ATGAACACAT ACAATCCCAG
781 AAGCTCAATC CTTTCAGTGA AGAGTTTGAT TACGAGCTGG CCATGTCCAC CCGCCTACAC
841 AAAGGAGATG AGGGCTATGG CCGCCCCAAA GAAGGAACCA AAACTGCTGA AAGGGCCAAG
901 CGTGCTGAGG AGCACATCTA CAGGGAAATG ATGGACATGT GCTTCATTAT <span style='color: red;'>CTGCAC</span>AATG
961 GCTCGCCACA GACGAGATGG CAAGATCCAG GTTACTTTTG GAGATCTCTT TGACAGATAC
1021 GTTCGTATTT CAGATAAAGT AGTGGGCATT CTCATGCGTG CCAGGAAACA TGGACTGGTA
1081 GACTTTGAAG GAGAGATGCT ATGGCAAGGC CGAGATGACC ATGTTGTGAT TACGCTACTC
1141 AAGTGA