Friday October 12, 11 AM Bldg. 38A, B2 Conference Room Constructing Near-Perfect Phylogenies Amar Mukherjee School of Electrical Engg. and Computer Science University of Central Florida, Orlando, FL. 32816 Email: amar@cs.ucf.edu A Single Nucleotide Polymorphisms or a SNP is a variation in the DNA sequence of a genome when a single nucleotide differs between the members of a species or between paired chromosomes in an individual. Almost all SNPs are bi-allelic. The minor allele frequency varies between population groups based on their ethnicity and geographical locations. Variations in the DNA sequence in humans can not only affect the phenotypes like color of eye or of hair or whether the earwax will be brown and sticky wet, it can also affect human behavior, mental and other inherited diseases. It is estimated that there are approximately ten million SNPs in the human genome. These SNPS are not completely independent of each other – blocks (contiguous regions) of neighboring SNPs on the same chromosome are inherited together. The pattern of SNPs on a block of the chromosome is called a haplotype.. The cost associated with empirically collecting haplotype data is prohibitively expensive.1 Therefore, the un-ordered bi-allelic genotype data is collected experimentally. The genotype data gives the two alleles in each SNP locus in an individual, but does not give information about which allele is on which copy of the chromosome. This necessitates computational techniques for inferring haplotypes from genotype data. This computational problem is called as the haplotype inference problem. The perfect phylogeny haplotyping (PPH) problem is to derive a set of haplotypes for a given set of genotypes with the condition that the haplotypes describe a perfect phylogeny. The perfect phylogeny approach to haplotype inference is applicable to the human genome due to the block structure of the human genome. An optimal O(nm) time algorithm for the PPH problem, where n is the number of genotypes and m is the number of SNPs, has recently been published [R. Vijaya Satya and A. Mukherjee, JCB, May 2005]. In this talk, the problem of constructing near-perfect phylogenies on bi-allelic haplotypes will be addressed. We present an efficient algorithm to determine if a given set of genotypes admit a phylogeny with a single homoplasy or back mutation event. The time complexity of our algorithm is O(m2(n+m)) which is a significant improvement over the earlier O(n4) algorithm. We also consider a generalization of the problem allowing q homoplasy events to take place at the same site taking O(mq+1(n+m)) time. Empirical analysis on simulated data shows that this algorithm produces more accurate results than PHASE (a popular haplotype inference program), while being approximately 1000 times faster than phase