Friday October 12, 11 AM

Bldg. 38A, B2 Conference Room

 

Constructing Near-Perfect Phylogenies 

Amar Mukherjee 

School of Electrical Engg. and Computer Science 

University of Central Florida, Orlando, FL. 32816 

Email: amar@cs.ucf.edu 

A Single Nucleotide Polymorphisms or a SNP is a variation in the DNA sequence of a genome when a single nucleotide differs between 
the members of a species or between paired chromosomes in an individual. Almost all SNPs are bi-allelic. The minor allele frequency
 varies between population groups based on their ethnicity and geographical locations. Variations in the DNA sequence in humans can 
not only affect the phenotypes like color of eye or of hair or whether the earwax will be brown and sticky wet, it can also affect
 human behavior, mental and other inherited diseases. 

It is estimated that there are approximately ten million SNPs in the human genome. These SNPS are not completely independent of 
each other – blocks (contiguous regions) of neighboring SNPs on the same chromosome are inherited together. The pattern of SNPs 
on a block of the chromosome is called a haplotype.. The cost associated with empirically collecting haplotype data is prohibitively 
expensive.1 Therefore, the un-ordered bi-allelic genotype data is collected experimentally. The genotype data gives the two alleles
 in each SNP locus in an individual, but does not give information about which allele is on which copy of the chromosome. This 
necessitates computational techniques for inferring haplotypes from genotype data. This computational problem is called as the 
haplotype inference problem. 

The perfect phylogeny haplotyping (PPH) problem is to derive a set of haplotypes for a given set of genotypes with the condition 
that the haplotypes describe a perfect phylogeny. The perfect phylogeny approach to haplotype inference is applicable to the human 
genome due to the block structure of the human genome. An optimal O(nm) time algorithm for the PPH problem, where n is the number 
of genotypes and m is the number of SNPs, has recently been published [R. Vijaya Satya and A. Mukherjee, JCB, May 2005]. 


In this talk, the problem of constructing near-perfect phylogenies on bi-allelic haplotypes will be addressed. We present an efficient 
algorithm to determine if a given set of genotypes admit a phylogeny with a single homoplasy or back mutation event. The time complexity
 of our algorithm is O(m2(n+m)) which is a significant improvement over the earlier O(n4) algorithm. We also consider a generalization 
of the problem allowing q homoplasy events to take place at the same site taking O(mq+1(n+m)) time. Empirical analysis on simulated 
data shows that this algorithm produces more accurate results than PHASE (a popular haplotype inference program), while being approximately 
1000 times faster than phase