Many software applications for linkage and association studies (e.g. plink, merlin) accept PED/MAP format. The map file records basic information about markers, the Ped file contains sample genotype. Whereas variant tools can easily export in MAP format, it is difficult to export in PED format because this format is not variant oriented (output by variant).
Fortunately, PLINK accepts a transposed PED format (TPED) that is variant oriented. variant tools can export variants in this format using format tped
.
Format tped
cannot import from TPED files because this format does not specify reference and alternative alleles. To import PLINK TPED data please use [format plink
][3]
To make the exported TPED file compatible with other software applications that handles PLINK input, a TFAM file containing individual names is generated. Unlike a standard TFAM file which is the first six columns of a PED file, i.e., Family ID, Individual ID, Paternal ID, Maternal ID, Sex and Phenotype, the TFAM file exported here only has valid individual ID, with other columns being placeholders. However it is straightforward to create a TFAM file with information for other columns using vtools phenotype --output
command.
vtools show format tped
Format: tped
Description: Output to TPED format with the first four columns chr name gen_pos
pos, and the rest for genotypes. Variant tools cannot import from
this format because it does not contain information about reference
genome.
Columns:
1 chromosome (without leading chr)
2 Locus name
3 Genetic distance, left empty
4 Physical position
5 genotype
Formatters are provided for fields: gt
variant:
chr Chromosome
pos 1-based position
ref
alt
Format parameters:
name (export) Field for name of variants, can be dbSNP.name if dbSNP is
available (default: )
style (export) Style of genotype format, can be 'genotype' for genotype
separated by tab (e.g. A C), or 'numeric' for 0, 1, or
2 for number of alternative alleles. (default:
genotype)
tfamfile (export) Name of the tfam file to be outputed. Filename that does not
ends with .tfam will be ignored. (default: \\(table.tfam)
By default, the corresponding TFAM file is named after the variant table from which variants/samples are exported. You can specify the TFAM file name by passing an additional parameter with specified filename: --tfamfile FILENAME.tfam
. Note that the extension .tfam
is required.
Output some variants. The genotypes for selected samples are ordered alphabetically by sample name (and the selected samples are recorded in the log file).
vtools export variant --format tped --samples 'sample_name like "NA069%"' -v0
1 . . 10533 G G G G G G
1 . . 51479 T T T A T A
1 . . 51928 G G G G G G
1 . . 54586 T T T T T T
1 . . 54676 C C C C C C
1 . . 54708 G G G G G G
1 . . 55299 C C C C C C
1 . . 62203 T T T T T T
1 . . 63671 G G G A G A
1 . . 86028 T T T T T T
1 . . 86065 G G G G G G
1 . . 86331 A A A A A A
1 . . 87190 G G G A G G
1 . . 88316 G G G G G G
1 . . 88338 G G G A G G
1 . . 91536 G G T T G G
1 . . 108310 C C T C T C
1 . . 233473 C C C C C G
1 . . 234760 A A A A A A
Export in numeric style. This will recode the tped file to a numeric format using additive coding, equivalent to applying the PLINK [--recodeA
][4] command to the data:
vtools export variant --format tped --samples 'sample_name like "NA069%"' --style numeric -v0
1 . . 10533 0 0 0
1 . . 51479 0 1 1
1 . . 51928 0 0 0
1 . . 54586 0 0 0
1 . . 54676 0 0 0
1 . . 54708 0 0 0
1 . . 55299 0 0 0
1 . . 62203 0 0 0
1 . . 63671 0 1 1
1 . . 86028 0 0 0
1 . . 86065 0 0 0
1 . . 86331 0 0 0
1 . . 87190 0 1 0
1 . . 88316 0 0 0
1 . . 88338 0 1 0
1 . . 91536 0 2 0
1 . . 108310 2 1 1
1 . . 233473 0 0 1
1 . . 234760 0 0 0
The name field of the output is empty. If you would like to use an annotation database to assign variants a name, you can do
vtools use dbSNP
vtools export variant --format tped --samples 'sample_name like "NA069%"' --name dbSNP.name -v0
1 rs114315702 . 10533 G G G G G G
1 rs116400033 . 51479 T T T A T A
1 rs78732933 . 51928 G G G G G G
1 rs79600414 . 54586 T T T T T T
1 rs2462492 . 54676 C C C C C C
1 rs115797567 . 54708 G G G G G G
1 rs10399749 . 55299 C C C C C C
1 rs28402963 . 62203 T T T T T T
1 rs80011619 . 63671 G G G A G A
1 rs114608975 . 86028 T T T T T T
1 rs116504101 . 86065 G G G G G G
1 rs115209712 . 86331 A A A A A A
1 rs1524602 . 87190 G G G A G G
1 rs113759966 . 88316 G G G G G G
1 rs55700207 . 88338 G G G A G G
1 rs6702460 . 91536 G G T T G G
1 rs74747225 . 108310 C C T C T C
1 rs112455420 . 233473 C C C C C G
1 rs7548182 . 234760 A A A A A A
In case there are multiple entries for a variant in the dbSNP database, only the first name will be used, which does not have to be the most commonly used name for the variant.
A TFAM file is also generated,
0 NA06985 0 0 0 -9
1 NA06986 0 0 0 -9
2 NA06994 0 0 0 -9
3 NA06985 0 0 0 -9
4 NA06986 0 0 0 -9
5 NA06994 0 0 0 -9
6 NA06985 0 0 0 -9
7 NA06986 0 0 0 -9
8 NA06994 0 0 0 -9
[3]: / /documentation/customization/format/supportedformats/tped/plink/ [4]: http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml