TagZilla

User's guide to TagZilla version 1.0

Introduction

This user's guide explains the use of the TagZilla program. TagZilla generates tag SNPs for any given set of SNPs with genotypes. The TagZilla SNP selection algorithm estimates pair-wise r²and d' (d-prime) linkage disequilibrium (LD) statistics on genotype data from unrelated individuals.It then estimates bins using a greedy maximal approach similar to that of Carlson et al. (2004), evaluates all the tags for each bin based upon user-specified criteria, and recommends an optimal tag for each bin if possible.

The program provides many options, such as minimum MAF (minor allele frequency), d' threshold and r² threshold, include/exclude/subset, optimization of numbers of bins for fixed sized panels, or thresholds to reach a desired level of coverage. It also provides options for different input formats such as HapMap, Linkage, and FESTA. Data in assay design scores and weighting criteria can be incorporated into the analysis in order to choosing optimal tags for each bin.

notice

TagZilla version 1.1 will be released at the end of next week. It will contain numerous updates, including several corrections and extensions to the multi-population tagging capabilities.

TagZilla options

TagZilla provides you with many options for controlling its execution. It can read in multiple genotype files that contain data from disjoint genomic regions. TagZilla allows each input genotype file to utilize different minimum MAF, r², completion rate and other parameters.

Usage:python tagzilla.py [options] genotype_file [options] genotype_file

Example:

python tagzilla.py -p pedinfo2sample_CEU.txt -b summary -o pairs.out -O loci.out –D designscore.txt:0.6 genotypes_chr21_CEU.txt.gz

TagZilla supports both short options and long options.

Short options: a token that starts with a dash followed by a letter. For example, -p
Long options: a token that starts with two dashes followed by a word. For example, --pedfile

You can use the --version option to check the program version and --help/-h option to print out help messages. The options are grouped into four categories:

Genotype and LD estimate options
Binning options
Input options
Output options

The table for each category serves as a quick reference to the options. More detailed discussions about the purpose and usage of the options are in the notes below each table.

Genotype and LD estimation options:

Option	Explanation
-a FREQ, --minmaf=FREQ	Minimum minor allele frequency (MAF) (default=0.05)
-A FREQ, --minobmaf=FREQ	Minimum minor allele frequency (MAF) for obligate tags (defaults to -a/--minmaf)
-c N, --mincompletion=N	Drop loci with less than N valid genotypes(Default=0)
--mincompletionrate=N%	Drop loci with completion rate less than N% (0-100, Default=0)
-m D, --maxdist=D	Maximum inter-marker distance in kb for LD comparison(default=200)
-P p, --hwp=p	Filter out loci that fail to meet a minimum significance level (p value) for the test of Hardy-Weinberg proportion

Notes:

-a/--minmaf and -A/--minobmaf:

Both options specify the MAF threshold to filter out loci with low MAF from the analysis. You can set different thresholds for obligates versus any other loci, but the default for -a/--minmaf is 0.05, and the default for -A/--minobmaf will take the value set for -a/--minmaf option.

-c/--mincompletion and --mincompletionrate:

These options are used to drop loci with low number or rate of valid genotypes among all the genotyped samples.

-m/--maxdist:

The linkage between loci usually decreases as the distance between loci increases. We won't consider the linkage disequilibrium between two loci if the distance between them is greater than the number specified in this option.

-P/--hwp:

This option is used to specify the threshold of P value for the Hardy-Weinberg Equilibrium test. If the count of the minor alleles in the set of genotypes is less than 1000, TagZilla applies the exact test based on Wigginton JE et al. (2005), otherwise it simply uses the standard Chi-square test. Loci that fail to meet this threshold are filtered from the analysis.

Binning options:

Option	Explanation
-C crit, --tagcriteria=crit	Use the specified criteria to choose the optimal tag for each bin Currently supported tag selection criteria: maxtag: choose the tag having largest minimum-r² with any tag snps in the bin maxsnp: choose the tag having largest minimum-r² with all snps in the bin avgtag: choose the tag having maximum average- r² with non-tag snps in the bin avgsnp: choose the tag having maximum average- r² with all snps in the bin
-d DPRIME,--dthreshold=DPRIME	Minimum d-prime threshold to output (default=0)
-r N, --rthreshold=N	Minimum r-squared threshold to output (default=0.8)
-t N, --targetbins=N	Stop when N bins have been selected (default=0 for unlimited)
-T N, --targetloci=N	Stop when N loci have been tagged (default=0 for unlimited)
-M N, --multipopulation=N	Multipopulation tagging where N is the number of populations
--multimerge	Merge populations when performing multipopulation tagging [not recommended]
-z N, --locipertag=N	Ensure that bins contain more than one tag per N loci. Bins with insufficient tags will be reduced.
-Z B, --loglocipertag=B	Ensure that bins contain more than the ceiling of log_B(loci) tags. Bins with insufficient tags will be reduced.

Notes:

-C/--tagcriteria:

Example: -C maxsnp:2

Give half the weight the each tag that does not meet the maxsnp criteria.

This option can be used together with the -D/--designscores option to specify how the optimal tag should be selected for each bin. -C/--tagcriteria provides the weights, and -D/--designscores provides the designscores. TagZilla will compute a weighted score and thus determine which tag is recommended to the user.

-d/--dthreshold and -r/--rthreshold:

Both are used as cut-off criteria so that only locus pairs satisfying these thresholds are considered in the binning process.

-t/--targetbins and -T/--targetloic:

Both options are used as stopping criteria. In either case, once the criteria are met, Tagzilla produces residual bins instead of maximal bins.

-M/--multipopulation and --multimerge:

You can specify the number of populations via -M/--multipopulation option. Tagzilla uses minLD method if -multimerge hasn't been set to bin the loci with genotypes from different populations and thus generate a set of tags applicable for all the populations. --multimerge option is not recommended.

-z/--loicpertag and -Z/--loglocipertag:

Both options control the ratio between the tags and loci.If the size of the bin is too large and thus the number of loci per tag is too big, the genotype failure on the tag will lead to losing information on lots of loci surrogated only by that tag.Instead of picking another candidate tag from large bin as in a post-process, TagZilla incorporates this user requirement into the binning process and generates bins only satisfying these requirements.

Input options:

Option	Explanation
-p FILE, --pedfile=FILE	Pedigree file for HapMap, PrettyBase or raw genotype files (optional) This option can be specified multiple times on the command line.
-s FILE, --subset=FILE	File containing loci that define the subset to be analyzed of the loci that are read
-l FILE, --loci=FILE	Locus description file for genotypes input in Linkage format
-i FILE, --includetag=FILE	File containing loci that are obligates
-L N, --limit=N	Limit the number of loci considered to N for testing purposes (default=0 for unlimited)
-f NAME, --format=NAME	Format for genotype/pedigree or LD input data.Values: hapmap(default), linkage, festa, prettybase, raw.
-e FILE, --excludetag=FILE	File containing loci that are excluded from being a tag
-D FILE, --designscores=FILE	Read in design scores or other weights to use as criteria to choose the optimal tag for each bin. Example: -D designscore1.txt:0.5:1 0.5 is the threshold, 1 is the scale. Both are optional for this option entry, the default value for threshold is 0 and the default value for scale is 1. This option can be specified multiple times on the command line.
-R S-E,… --range=S-E,…	Ranges of genomic locations to be analyzed. They are specified as a comma separated list of start and end coordinates "S-E". If either S or E is not specified, then the ranges are assumed to be open. The end coordinate is exclusive and not included in the range. Example: -R 10000-20000, 30000-80000

Notes:

-p/--pedfile:

This option specifies the pedigree file for those genotypes provided in the format of Hapmap, Prettybase or raw. It is not meaningful to specify a pedigree file when reading genotype or LD data in linkage or FESTA format. The genotypes for the non-founders as found in the pedigree file won't be considered in the binning process. Note that if the pedigree file is incomplete, we assume all the individuals not contained in the pedigree file are founders.

-s/--subset:

Besides providing a file containing the subset of loci to be analyzed, the user can also specify a comma separated list of loci from the command line. The string value for this option has to start with a colon.For example,

-s :rs12355,rs12365,rs12488

-l/--loci:

The locus description file for genotype input in linkage format. TagZilla reads in the location for each locus from this file.

-i/--includetag and -e/--excludetag:

Similar to -s/--subset option you can also specify a list of loci as the string value for both options in addition to a file name. The specified list of loci are either forced in as tags or excluded from being chosen as tags for non-excludes.

-D/--designscores:

This option can be used alone or together with -C/--tagcriteria to choose the optimal tag among all the valid tags for a bin.

-L/--limit:

This option is useful for testing purposes. If the genotype data are too big to complete a run quickly, you can limit the number of loci by specifying a value for the option.

-f/--format:

Current version supports five different formats: hapmap(default), linkage, festa, prettybase and raw (case insensitive)

-r/--range:

A list of genomic location pairs can be specified via this option to filter out the loci located outside these ranges.

Output options:

Option	Explanation
-O FILE, --locusinfo=FILE	Output locus information to FILE ('-'for standard out)
-o FILE, --output=FILE	Output tabular LD information for bins to FILE ('-'for standard out)
-x, --extra	Output inter-bin LD statistics to the file specified in -o/--output option.
-H N, --histomax=N	Largest bin size output in summary histogram output (default=10)
-k, --skip	Skip output of untagged or excluded loci
-b FILE, --summary=FILE	Output summary tables for all bins to FILE (Default to standard out)
-B FILE, --bininfo=FILE	Output summary information about each bin to FILE

Notes:

-o/--output, -x/--extra and -k/--skip:

-o/--output specifies the name of the output file containing LD information for the bins,-x/--extra triggers appending the inter-bin LD statistics to the same file, and -k/--skip skips output of the pair of loci if the disposition of the bin is either obligate-exclude or residual, or either one of the pair is in the exclude set.

-b/--summary and -H/--histomax:

-b/--summary specifies the name of the output file containing the histogram table summaries for all the bins, and -H/--histomax is the largest bin size that is included in the table.

-B/--bininfo:

This option specifies the name of the output file containing the summary information including tags, non-tags, bin size, location and spacing about each bin.

-O/--locusinfo:

This option specifies the name of the output file containing the locus information such as location, MAF, bin number and disposition for each locus.

File formats

This section describes the file format for both input and output files. Following are the allowable input files:

Genotype data in Hapmap format
Pedigree file for Hapmap genotype data
Genotype data in linkage format
Locus info file for linkage format genotype data
Genotype data in “raw” format
Pedigree file for “raw” format genotype data
Genotype data in prettybase format
Pre-computed pair-wise LD values in FESTA format
SNP list files
Design score files

Following are the output files:

Bin info file
LD data file
Locus info file
Bin summary statistics file

Some simple examples are included to illustrate the format more clearly, and some example files are included in the TagZilla package, which can be referred to if needed.

Hapmap format:

The header of Hapmap format files looks like

'rs#SNPalleleschromposstrand genome_build centerprotLSID assayLSIDpanelLSIDQC_code' followed by a list of sample identifiers.The program will check against a pedigree file if it is specified from the command line by '-p' option to set only the genotypes from non-related individuals (i.e. founders) for each locus for further analysis.Here is a sample line of the hapmap format genotype file:

rs169757 A/C Chr21 9928594 + ncbi_b35.1 broad urn:LSID:affymetrix.hapmap.org:Protocol:genotype_protocol_1:1 urn:LSID:affymetrix.hapmap.org:Assay:1612756:1 urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1 QC+ AC AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AC AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AC AA AC AA AA AA AA AA AA AA AA AC AA AA AC AC AA AA AA AA AA AA AA AA

The following table describes the valid values for each column:

Column header	Description
rs#	A string of characters starting with letters 'rs' then followed by digits, e.g. rs12345
SNPalleles	All possible alleles (A, G, C or T) for the SNP with each separated by a forward slash, e.g. A/G
Chromo	Three-letter String 'Chr' followed by a number from 1 to 22 or a letter X or Y for sex chromosome, e.g. Chr22
Pos	position of the SNP, an integer number
Strand	One single character, either '+' or '-'. '+' refers to a strand going from 5-prime telomere to 3-prime telomere, and'-' refers to a strand going from 3-prime telomere to 5-prime telomere.
genome_build	A string of characters, e.g.ncbi_b35.1
Center	A string of characters,e.g. broad
protLSID	A string of characters, e.g. urn:LSID:affymetrix.hapmap.org:Protocol:genotype_protocol_1:1
assayLSID	A string of characters, e.g. urn:lsid:affymetrix.hapmap.org:Assay:1612756:1
panelLSID	A string of characters, e.g. urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1
QC_code	Either 'QC+' or 'QC-'
genotypes	A pair of letters, with each letter chosen from the set of (A,G,C,T,N), e.g. AG

Note that all columns are case-sensitive if not mentioned otherwise, no space is allowed within each column

Pedigree file for Hapmap format genotype data:

This is specified by the '-p' option from the command line. The valid value for each columns of the file include:

Column number	Description
1	pedigree id, an integer
2	individual id, an integer
3	father id, an integer
4	mother id, an integer
5	Sex, a single digit, 1 for male, and 2 for female
6	hapmap individual id, a string of characters. example: urn:lsid:dcc.hapmap.org:Individual:CEPH1420.09:1
7	hapmap sample id, a string of characters. Example: urn:lsid:dcc.hapmap.org:Sample:NA12003:1

Rows having the same pedigree id constitute individuals belonging to the same family. If an individual is a founder, they will have both father and mother id set to 0, otherwise, the values will relate to the individual identifiers from other lines within the file. TagZilla will utilize only founders in the input data.

1420   9      0      0      1 urn:lsid:dcc.hapmap.org:Individual:CEPH1420.09:1 urn:lsid:dcc.hapmap.org:Sample:NA12003:1

Linkage format:

Linkage format is the required input format for Haploview. This file should not have any header lines. The valid values for each column are:

Column number	Description
1	Pedigree name: a unique alphanumeric identifier for this individual's family. Unrelated individuals shouldn't share a common pedigree name.
2	Individual Id: a unique alphanumeric identifier for this individual within his family.
3	Father ID: father's individual ID or '0' for unknown father. Note that if a father ID is specified, the father must also appear in the file.
4	Mother ID: mother's individual ID or '0' for unknown mother. Note that if a mother ID is specified, the mother must also appear in the file.
5	Genders: Individual's gender(1 for male, 2 for female)
6	Affectation status:used for association tests(0 for unknown, 1 for unaffected and 2 for affected).
>6	Marker genotypes: each marker is represented by two columns (one for each allele, separated by a space) and coded 1-4 where: 1=A, 2=C, 3=G, T=4. A 0 in any of the marker genotype position indicates missing data.

Files should also follow these two guidelines:

· Families should be listed consecutively within the file (i.e. all the lines with the same pedigree ID should be adjacent).

If an individual has a nonzero parent, the parent should be included in the file on his own line. Below is a sample line of the linkage format genotype file.

31289121 23 30 04 2

Locus info file:

This file is required when processing linkage format genotype data, and it is specified by using '-l' option from the command line.Each line of the file has two columns. The first column is the locus name and the second column is locus location. For example:

rs169757 9928594

Raw format genotype data file:

TagZilla reads in the genotype data in this format if the '-f raw' option is specified from the command line. It has a header line which has this format 'rs#<tab>chr<tab>pos<tab>' followed by a tab delimited list of sample ids. The following table describes the valid values for each column:

Column	Description
rs#	A string of characters startin A string of characters starting with letters 'rs' then followed by digits, e.g.rs12345
chr	Three-letter String 'Chr' followed by a number from 1 to 22 or a letter X or Y for sex chromosome, e.g. Chr22
pos	position of the SNP, an integer number, e.g. 54321
genotypes	A pair of letters with each letter chosen from the set of (A,G,C,T,N), e.g. AG

TagZilla checks against a pedigree file and set only the genotypes from founders for each locus for further analysis.

Following is a sample line of the raw format genotype file.

rs169757 Chr21 9928594 AA AA AA AC AA AA AC AA AA AA AA AC AA AA

Pedigree file for raw format genotype data:

This file is optionally used when processing raw format genotype data.

Note that the individual id must be unique in the file and can be mapped to one of the genotype columns in the header line of the raw format genotype data file. Refer to “Pedigree file for Hapmap format genotype data” for more details.

Prettybase format genotype data file:

The following table describes the valid values for each column of the prettybase format genotype file:

Column	Description
Site position	An integer uniquely identifying the locus
Individual id	A string of characters uniquely identifying the individual, case-sensitive
First allele	One character chosen from the set (A,G,C,T,?) , with '?' for unknown
Second allele	One character chosen from the set (A,G,C,T,?) , with '?' for unknown

Following are some sample lines of a prettybase genotype file:

10110 PT01B    C        C

10110 PT02B    G        G

10110 PT03B    G        G

10110 PT04B    G        G

10110 PT05B    G        G

10110 PT06B    G        G

10287 PT01B    ?        G

10287 PT02B    C        C

10287 PT03B    C        C

10287 PT04B    C        C

10287 PT05B    C        C

10287 PT06B    C        C

FESTA formatted linkage disequilibrium files:

TagZilla can read in these files containing the pre-computed pair-wise LD parameter between the SNPs in certain region.For details about the format of these files, user can refer to this link: http://www.sph.umich.edu/csg/qin/FESTA/sample_files/

SNP list files:

These files contain lists of loci for the purpose of sub-setting, specifying loci that must be included as tags, or excluding loci from being tags during the analysis process.

'-s' is used to tell TagZilla to read in a subset of all genotyped loci
'-i' is used for the list of obligatorily included loci
'-e' is used for the list of obligatorily excluded loci

These set of files have the same simple format, no headers, with one locus name on each line, and the locus name is case-sensitive. For example:

rs150379

rs469673

rs212121

rs210499

rs469536

However, if the first character of the argument on any of these options is a colon ':', then the remainder of the argument is processed as a comma-delimited list of loci.For example, -i :rs512331, rs1221.This method is sometimes convenient when running TagZilla iteratively from the command-line.

Design score files:

These files contain the design score information for SNPs. Each line of the file must contain the name of the SNP and its design score. TagZilla allows multiple design score files to be specified from the command line, and information in all files will considered during tag selection stage.

If the design score for a SNP is 0 or below the given threshold, that SNP will be forced into the exclude set. If this SNP also happens to be in the include set, then the disposition of the bin containing this SNP will be obligate-include, the SNP will be reported as obligate_tag (because it is in the include set) and also as one of the excluded_as_tags (because it is forced into the exclude set). Therefore, include will take priority over exclude in our program.

Following are some sample lines of a design score file:

rs150379 0.8

rs469673 0.9

rs212121 0.7

rs210499 0.6

rs469536 0.5

There are four different output files, and only one of these files can be directed to standard output, others must be output to the files with names specified in the command line options. The output will contain the following information about each bin chosen by TagZilla:

all possible tags
one recommend tag
the total number of loci in the bin
the summary statistics in tabular format for all the bins
Pair-wise LD statistics for each bin (with an option to also include the inter-bin LD statistics)
Locus information including the MAF, disposition and bin number for each locus.

Bin info file:

The name and location of this file are specified in the '-B' option. The bin number will appear multiple times as we output all the information for that bin.This format is an expanded version of the output produced by the program ldSelect (Carlson et al., 2004).The following table describes the information produced for each bin:

Row number	Description
1	summary line: contains the total number of sites for the bin, the number of tags, the number of non-tags, the number of required tags, the width, and the average MAF for the bin
2	detailed location information: minimum, median, average and maximum location of all the loci in the bin
3	detailed spacing information: minimum, median, average and maximum spacing among all the loci in the bin
4	tag SNPs
5	recommend tag SNP
6	Other SNPs
7	excluded tag SNPs
8	bin disposition (four possible values: 'obligate-include', 'maximal-bin', 'residual', 'obligate-exclude')
9	Number of loci that would have been covered by the bin, note that for obligate include bins only the obligatory tags are considered.

Here is an example bin info file generated by TagZilla:

Bin 1    sites: 9, tags 3, other 6, tags required 1, width 40229, avg. MAF 49.0%
Bin 1    Location: min 119730461, median 119769349, average 119760254, max 119770690
Bin 1    Spacing: min 214, median 1855, average 5028, max 16495
Bin 1    TagSnps: G11-SN-3PS10 G11-SN-3PS11 rs6204
Bin 1    RecommendedTags: rs6204
Bin 1    other_snps: G11-SN-3PS3 G11-SN-3PS7 rs1998182 rs2064902 rs6200 rs6686779
Bin 1    Excluded_as_tags: rs6200
Bin 1    Bin_disposition: maximal-bin
Bin 1    Loci_covered: 9

Bin 2    sites: 9, tags 5, other 4, tags required 1, width 9729, avg. MAF 48.5%
Bin 2    Location: min 35922900, median 35926631, average 35926790, max 35932629
Bin 2    Spacing: min 299, median 919, average 1216, max 4179
Bin 2    TagSnps: G22-SN-E2S28 G22-SN-E2S33 rs69264 rs86582 rs9622573
Bin 2    RecommendedTags: rs69264
Bin 2    other_snps: rs229559 rs229566 rs6413537 rs739040
Bin 2    Excluded_as_tags: rs6413537
Bin 2    Bin_disposition: maximal-bin
Bin 2    Loci_covered: 9

Bin 3    sites: 5, tags 1, other 4, tags required 1, width 39002, avg. MAF 33.7%
Bin 3    Location: min 119734713, median 119770024, average 119757461, max 119773715
Bin 3    Spacing: min 939, median 2966, average 9750, max 32130
Bin 3    TagSnps: rs1812256
Bin 3    RecommendedTags: rs1812256
Bin 3    other_snps: rs10754400 rs4659182 rs6667572 rs7535128
Bin 3    Bin_disposition: maximal-bin
Bin 3    Loci_covered: 5

Bin 4    sites: 3, tags 3, other 0, tags required 1, width 2251, avg. MAF 32.0%
Bin 4    Location: min 35924206, median 35924854, average 35925172, max 35926457
Bin 4    Spacing: min 648, median 1125, average 1125, max 1603
Bin 4    TagSnps: G22-SN-E2S24 rs1861945 rs229565
Bin 4    RecommendedTags: rs1861945
Bin 4    other_snps: 
Bin 4    Bin_disposition: maximal-bin
Bin 4    Loci_covered: 3

Bin 5    sites: 3, tags 3, other 0, tags required 1, width 4452, avg. MAF 12.8%
Bin 5    Location: min 35926946, median 35927521, average 35928621, max 35931398
Bin 5    Spacing: min 575, median 2226, average 2226, max 3877
Bin 5    TagSnps: G22-SN-E1S1 rs2071710 rs229567
Bin 5    RecommendedTags: rs229567
Bin 5    other_snps: 
Bin 5    Bin_disposition: maximal-bin
Bin 5    Loci_covered: 3

LD data output file:

The name and location of this file are specified in '-o' option. The first line is the header line. The following table describes each column in the LD data output file:

Column number	Description
1	a sequence number for identifying each bin
2	the first locus name of the pair
3	the second locus name of the pair
4	the rsquared value for the pair
5	Disposition ( see the table below for details)

All possible values for the disposition of each LD pair are summarized in the two tables below. The first table describes different dispositions for the tags paired with themselves, and the second table is for the rest of the LD pairs within each bin.

Tags paired with themselves:

Disposition	Description
obligate-tag	An obligate tag
alternate-tag	A tag in an obligate-include bin, but not the obligate tag
excluded-tag	A tag for a bin that contains all obligatorily excluded loci
candidate-tag	A tag for a non obligate bin with more than one possible tags
necessary-tag	A tag for a bin that has only one possible tag
lonely-tag	A tag for bin with no other loci, but originally covered more loci. These additional loci were removed by previous iterations of the binning algorithm. This disposition is primarily to distinguish these bins from singletons, which intrinsically are in insufficient LD with any other locus.
singleton-tag	A tag that is not in significant LD with any other locus based upon specified LD threshold.

Note: 'recommended' will be appended to the above disposition to indicate that it is also an optimal tag chosen among all the possible tags for a bin by comparing the score and checking certain criteria provided that these options are set from the command line.

Other LD pairs in the bin:

Disposition	Description
tag-tag	LDbetween tags within a bin
other-tag	LD between a no-tag and a tag
tag-other	LD between atag and non-tag
other-other	LD between non-tags within a bin

Note that for residual bins, the dispositions for all LD pairs within each bin will have a 'residual' qualifier appended to them, and for obligate exclude bins, the dispositions for all LD pairs will have an 'excluded' qualifier appended to them.Also if the user specifies the '-x' option, the 'interbin' qualifier will appear in the disposition column for all residual LD pairs that sit in the bottom part of this output file.The LD pairs are formed based on each individual genotype input file, i.e., TagZilla doesn't look for significant LD among loci in multiple input files. The LD data is presorted by rsquared, and then locus1, and then locus2 for each bin. Following is an example of an LD data output file:

BIN	LNAME1	LNAME2	RSQUARED	DPRIME	DISPOSITION
1	rs150379	rs150379	1	1	obligate-tag
1	rs469673	rs469673	1	1	alternate-tag
1	rs212121	rs212121	1	1	alternate-tag,recommended
1	rs212111	rs212111	1	1	alternate-tag
1	rs210499	rs210499	1	1	alternate-tag
1	rs469536	rs469536	1	1	alternate-tag
1	rs469667	rs469667	1	1	alternate-tag
1	rs210499	rs469667	1	1	tag-tag
1	rs150379	rs469673	1	1	tag-tag
1	rs210534	rs150379	1	1	other-tag
1	rs210534	rs469673	1	1	other-tag
1	rs150379	rs212121	1	1	tag-tag
1	rs212121	rs469673	1	1	tag-tag
1	rs150379	rs469536	1	1	tag-tag
1	rs210534	rs469536	1	1	other-tag
1	rs469673	rs469536	1	1	tag-tag
1	rs210534	rs212121	1	1	other-tag
1	rs212121	rs469536	1	1	tag-tag
1	rs212111	rs212121	0.932	1	tag-tag
1	rs210499	rs212121	0.928	1	tag-tag
1	rs212121	rs469667	0.928	1	tag-tag
1	rs210534	rs469667	0.919	1	other-tag
1	rs210499	rs210534	0.919	1	tag-other
1	rs150379	rs212111	0.919	1	tag-tag
1	rs212111	rs469673	0.919	1	tag-tag
1	rs210499	rs150379	0.914	1	tag-tag
1	rs210499	rs469673	0.914	1	tag-tag
1	rs150379	rs469667	0.914	1	tag-tag
1	rs469673	rs469667	0.914	1	tag-tag
1	rs212111	rs469536	0.914	1	tag-tag
1	rs210499	rs469536	0.907	1	tag-tag
1	rs469536	rs469667	0.907	1	tag-tag
1	rs212111	rs469667	0.865	1	tag-tag
1	rs210499	rs212111	0.865	1	tag-tag
1	rs210534	rs212111	0.859	0.927	other-tag
2	rs4913553	rs4913553	1	1	candidate-tag,recommended
2	rs1827997	rs1827997	1	1	candidate-tag
2	rs4913553	rs1827997	0.897	1	tag-tag
3	rs169757	rs169757	1	1	candidate-tag,recommended
3	rs456706	rs456706	1	1	candidate-tag
3	rs169757	rs456706	1	1	tag-tag
4	rs240446	rs240446	1	1	singleton-tag,recommended
5	rs10439884	rs10439884	1	1	singleton-tag,recommended
6	rs11088417	rs11088417	1	1	singleton-tag,recommended,residual
7	rs12172917	rs12172917	1	1	excluded-tag,recommended,excluded

Locus info data output file:

The file name and location can be specified by the '-O' option on the command line. The first line is the header line. The following table describes each column in the locus info data output file:

Column	Description
1	Locus name
2	Location of the locus
3	MAF(Minor Allele Frequency)
4	Bin number
5	Disposition

There are two possible disposition categories for each locus.

tag category, please refer to the self pair-wise LD disposition table for details.
non tag category, there are two possible dispositions: 'exclude' or 'other'.Note that for a residual bin, the disposition for all loci in that bin will have a 'residual' qualifier, and for an obligate exclude bin, the disposition for all loci in that bin will have an 'excluded' qualifier.

Following is an example of a locus data info output file. The contents in the file are sorted by bin number, and within each bin sorted by tags first and then non tags.

LNAME         LOCATION    MAF           BINNUM      DISPOSITION
rs150379      9978594     0.116         1           obligate-tag
rs469673      10011786    0.116         1           alternate-tag
rs212121      9986010     0.136         1           alternate-tag,recommended
rs212111      9981677     0.15          1           alternate-tag
rs210499      9929079     0.127         1           alternate-tag
rs469536      10016358    0.109         1           alternate-tag
rs469667      10018800    0.127         1           alternate-tag
rs210534      9972502     0.138         1           exclude
rs4913553     9941912     0.375         2           candidate-tag,recommended
rs1827997     9947160     0.35          2           candidate-tag
rs169757      9928594     0.067         3           candidate-tag,recommended
rs456706      10022975    0.067         3           candidate-tag
rs240446      10000969    0.092         4           singleton-tag,recommended
rs10439884    9993822     0.083         5           singleton-tag,recommended
rs11088417    13262512    0.067         6           singleton-tag,recommended,residual
rs12172917    13279705    0.417         7           excluded-tag,recommended,excluded

Bin summary statistics output file:

There are four types of bins: obligate-include, maximal-bin, residualand obligate-exclude.

Obligate-include bin is a bin with an obligatorily included locus.
Obligate-exclude bin is a bin with an obligatorily excluded locus chosen as a tag for the bin.
The rest of the bins will fall into the maximal-bin category unless the targeted number of loci or targeted number of bins have been met, in which case they will be called residual bins.

The output file includes a table summarizing the bin statistics by bin size (a histogram) for each type of bin. The maximum size of bin shown as one row in the table can be configured with the '-H' option, and these tables in this output file share the common set of columns. The following table describes the content of each column:

Column	Description
Bin size	The number of loci in the bin
Bins	The number of bins with specified bin size
%	The percent of bins with specified bin size
Loci	The number of loci contained in all the bins with specified bin size
%	The percent of loci contained in all the bins with specified bin size
Tags	The number of tags for all the bins with specified bin size
Non tags	The number of non tags for all the bins with specified bin size
Avg tags	The average number of tags per bin for all the bins with specified bin size
Avg width	The average width for all the bins with specified bin size. The width for a bin is the difference between the maximum location and minimum location of the loci in a bin.

At the end of the file there is a final summary table showing the bin statistics further summarized by bin disposition. Following is an example of the bin summary file:

Bin statistics by bin size for obligate-include:

 bin                                     non-    avg    avg
 size  bins     %   loci      %   tags   tags   tags  width
 ----- ----- ------ ------ ------ ------ ------ ---- ------
   8       1 100.00      8 100.00      7      1  7.0  89721
 Total     1 100.00      8 100.00      7      1  7.0  89721


Bin statistics by bin size for maximal-bin:

 bin                                     non-    avg    avg
 size  bins     %   loci      %   tags   tags   tags  width
 ----- ----- ------ ------ ------ ------ ------ ---- ------
 singl     2  50.00      2  33.33      2      0  1.0      0
   1       0   0.00      0   0.00      0      0  0.0      0
   2       2  50.00      4  66.67      4      0  2.0  49814
 Total     4 100.00      6 100.00      6      0  1.5  24907


Bin statistics by bin size for residual:

 bin                                     non-    avg    avg
 size  bins     %   loci      %   tags   tags   tags  width
 ----- ----- ------ ------ ------ ------ ------ ---- ------
 singl     1 100.00      1 100.00      1      0  1.0      0
 Total     1 100.00      1 100.00      1      0  1.0      0


Bin statistics by bin size for obligate-exclude:

 bin                                     non-    avg    avg
 size  bins     %   loci      %   tags   tags   tags  width
 ----- ----- ------ ------ ------ ------ ------ ---- ------
 singl     1 100.00      1 100.00      1      0  1.0      0
 Total     1 100.00      1 100.00      1      0  1.0      0


Bin statistics by disposition:
                                                        non-    avg    avg
 disposition          bins     %   loci      %   tags   tags   tags  width
 -------------------- ----- ------ ------ ------ ------ ------ ---- ------
 obligate-include         1  14.29      8  50.00      7      1  7.0  89721
 maximal-bin              4  57.14      6  37.50      6      0  1.5  24907
 residual                 1  14.29      1   6.25      1      0  1.0      0
 obligate-exclude         1  14.29      1   6.25      1      0  1.0      0
               Total      7 100.00     16 100.00     15      1  2.1  27050

TagZilla Package

The package contains the following:

Python source files (.py files)
C files which have optional accelerated implementations of some TagZilla functions (.c files)
PYD files which are the dynamic libraries generated on Windows platform from the provided C files (*.pyd)
Sample input/output files

Users need to download and install python from its official website http://www.python.org/ for the desired platform. TagZilla version 1.0 requires python 2.4 and above and it can run on different OS platforms including Unix/Linux and Windows. UNIX users will need to run 'python setup.py install' or 'python setup.py build_ext -i' in case of having no root access to install TagZilla and build the accelerators.Install command for users without root access: Windows users will receive precompiled binary accelerators distributed with the package and those libraries need to be placed in the site-packages directory of your python installation.

NOTE:

Users should expect significantly increased computation times when TagZilla is run without native code accelerators. A warning will be printed when accelerators are not available or properly installed. Please do not post comparative timing or benchmark results when running TagZilla in this manner.

TagZilla License

The software subject to this notice and license includes both human readable source code form and machine readable, binary, object code form ("the TagZilla Software"). The TagZilla Software was developed in conjunction with the National Cancer Institute ("NCI") by NCI employees and employees or contractors of SAIC. To the extent government employees are authors, any rights in such works shall be subject to Title 17 of the United States Code, section 105.

This TagZilla Software License (the "License") is between NCI and You. "You (or "Your") shall mean a person or an entity, and all other entities that control, are controlled by, or are under common control with the entity. "Control" for purposes of this definition means (i) the direct or indirect power to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.

This License is granted provided that You agree to the conditions described below. NCI grants You a non-exclusive, worldwide, perpetual, fully-paid-up, no-charge, irrevocable, transferable and royalty-free right and license in its rights in the TagZilla Software to (i) use, install, access, operate, execute, copy, modify, translate, market, publicly display, publicly perform, and prepare derivative works of the TagZilla Software; (ii) distribute and have distributed to and by third parties the TagZilla Software and any modifications and derivative works thereof; and (iii) sublicense the foregoing rights set out in (i) and (ii) to third parties, including the right to license such rights to further third parties. For sake of clarity, and not by way of limitation, NCI shall have no right of accounting or right of payment from You or Your sublicensees for the rights granted under this License. This License is granted at no charge to You.

1. Your redistributions of the source code for the Software must retain the above copyright notice, this list of conditions and the disclaimer and limitation of liability of Article 6, below. Your redistributions in object code form must reproduce the above copyright notice, this list of conditions and the disclaimer of Article 6 in the documentation and/or other materials provided with the distribution, if any.

2. Your end-user documentation included with the redistribution, if any, must include the following acknowledgment: "This product includes software developed by SAIC and the National Cancer Institute." If You do not include such end-user documentation, You shall include this acknowledgment in the Software itself, wherever such third-party acknowledgments normally appear.

3. You may not use the names "The National Cancer Institute", "NCI" "Science Applications International Corporation" and "SAIC" to endorse or promote products derived from this Software. This License does not authorize You to use any trademarks, service marks, trade names, logos or product names of either NCI or SAIC, except as required to comply with the terms of this License.

4. For sake of clarity, and not by way of limitation, You may incorporate this Software into Your proprietary programs and into any third party proprietary programs. However, if You incorporate the Software into third party proprietary programs, You agree that You are solely responsible for obtaining any permission from such third parties required to incorporate the Software into such third party proprietary programs and for informing Your sublicensees, including without limitation Your end-users, of their obligation to secure any required permissions from such third parties before incorporating the Software into such third party proprietary software programs. In the event that You fail to obtain such permissions, You agree to indemnify NCI for any claims against NCI by such third parties, except to the extent prohibited by law, resulting from Your failure to obtain such permissions.

5. For sake of clarity, and not by way of limitation, You may add Your own copyright statement to Your modifications and to the derivative works, and You may provide additional or different license terms and conditions in Your sublicenses of modifications of the Software, or any derivative works of the Software as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.

6. THIS SOFTWARE IS PROVIDED "AS IS," AND ANY EXPRESSED OR IMPLIED WARRANTIES, (INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE) ARE DISCLAIMED. IN NO EVENT SHALL THE NATIONAL CANCER INSTITUTE, SAIC, OR THEIR AFFILIATES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

References

Carlson C.S. et al. (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analysis using linkage disequilibrium. Am. J. Hum. Genet. 74, 106-120
Wigginton J.E. et al. (2005) A Note on Exact Tests of Hardy-Weinberg Equilibrium. Am. J. Hum. Genet. 76, 887-93
Zhaohui S. Qin et al. (2006) An efficient comprehensive search algorithm for tagSNP selection using linkage disequilibrium criteria. Bioinformatics. 22(2):220-5.

Contact information

In case of any questions regarding this documentation or the TagZilla program, please contact Kevin Jacobs by email at jacobske@mail.nih.gov.