Phylogenetic Motif Detection by EMnEM version 1.0 Copyright 2003 by Alan Moses. This software is provided "as is" without warranty of any kind. The author assumes no responsibility for the results it produces or conclusions based thereupon. It is distributed free of charge for academic use only. Permission to copy and use it is granted free of charge provided that no fee is charged and this copyright notice is not removed. Suggested citation: Moses AM, Chiang DY, Eisen MB, Pac Symp Biocomput. 2004 Contents 1. Installing EMnEM on linux 2. Getting started with EMnEM 3. The evm.ctl control file 4. The multi-fasta alignment file 5. The tree file 1. Installing EMnEM on linux To install, first copy the archive EMnEM.tar to a directory of choice. Type the following commands: tar -xvf EMnEM.tar make install the following files should appear in the directory: EMnEM (the executable) evm.ctl (the control file) evomix1.0.cpp (the source code) Makefile clustal2fasta.pl (a perl script to convert clustalw output to multi-fasta format) all.tree (an example tree file) example (a directory with example files) EMnEMdoc.txt (this file) and you should now be able to run the program by typing ./EMnEM 2. Getting stated with EMnEM EMnEM needs to read several files to work properly. First, a control file, called evm.ctl which must be in the same directory as the executable. This file contains all of the paramters. Second, EMnEM reads DNA sequence alignment(s) in multi-fasta format. Third, if the sequence alignemnt contains >2 species (is more than a pairwise alignment) EMnEM also needs a phylogenetic tree that relates the sequences. Trees are not needed for 2 species (pairwise) alignments. 3. The evm.ctl control file EMnEM uses a control file in a very similar way to the programs in the PAML (Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood CABIOS 13:555-556) package. The control file is a plain text file and can be modified in any text editor. Control file for EMnEM (commands begin with a dash, so don't use any dashes in comments) motif finding parameters - i 100 number of iterations. If the likelihood is still increasing after 100 iterations, try increasing this. - w 17 width of motif. If no matrix or consensus is provided, EMnEM will try all the words of length w in the dataset. c NNNNNNNNN start the EM with this consensus. Takes N's but no other IUPAC symbols. x example/gal4sites.matrix initial matrix file. Start the EM with this matrix. - e 2.0 expected number of instances of each motif per sequence. Set higher if you have a few very long sequences - t 0.9 posterior prob. threshold for motif instances. This is the stringency for what gets called a motif. If no instances are reported, try lowering it. - b 0 1 to look on both strands, 0 to look on one - n 1 number of motifs to look for. EMnEM will remove the instances of the prior motifs and then search again. The instances in the output file, and final likelihood, however, are computed using the entire data, so instances may overlap. Furthermore, the order that EMnEM finds motifs is determined be The choice of initialization (e.g., consensus vs. all words of length w) so the best motif is the one that has the highest likelihood as reported in the EMnEMscores output file. h turn off heuristics (only use if the likelihood is decreasing a lot) tree files - s 0 2=pairwise alignments (no trees required) 1=use the tree found in all.tree for all the alignments, 0=use the tree found in name.tree for each alignment name.fasta initial rate model for the motif (ratio of the rate for the motif to background at each position.) - p 0.5 relative rate at position 1 - p 0.5 relative rate at position 2 - p 0.5 relative rate at position 3 - p 0.5 relative rate at position 4 as many as w positions can be specified. If fewer than w are specified, the rest are set to 1.0 (background rate) What to do - u 1 0 = keep starting matrix (to, for example, calculate the likelihood for an existing matrix) 1 = update matrix (motif finding) - r 0 0=keep starting rates (as specified by the p commands) 1=estimate (ML) JC rate for the motif (1 parameter) 2=estimate (ML) JC rate at each position (w parameters) ML rates are only available for pairwise alignments 3=Halperin-Bruno selection Evolutionary model - m 0 0=jukes cantor 1=F81 display - d 1 0=nothing 1=print to screen output - o 1 print EMnEMscores file. This file contains the matrix, the predicted instances of the motif and the liklihood. 4. The multi-fasta alignment file If EMnEM finds the evm.ctl control file, it will look for a multi-fasta alignment file on the command line. For example, ./EMnEM example/YBR018C_upstream.aln.fasta runs the algorithm on the alignment found in the file example/YBR018C_upstream.aln.fasta ./EMnEM example/*aln.fasta runs the algorithm on all the alignment files in that directory simultaneously. Included in the EMnEM distribution is a perl script called clustal2fasta.pl that converts clustalw format alignments to multi-fasta. perl clustal2fasta.pl alignment.aln will produce a file called alignment.aln.fasta containing the alignment in multi-fasta format. 5. The tree file If EMnEM find a fasta alignment file with more than 2 sequences, it will look for a tree file that describes their phylogenetic relationship. If the -s option is set to 1, EMnEM will use tree found in the file all.tree for all the sequences. This file should be in the same directory as the executable. Alternatively, if -s is set to 0, a tree file with the same name as the fasta file, except the extension .tree instead of .fasta is required. This file should in the same directory as the fasta file. The tree file is structured as follows t 7 r 6 n 5 l 0 n 5 n 4 l 2 n 4 l 1 l 3 b 4 0.09787 0.13253 5 0.12665 0.23934 6 0.154675 0.154675 The first section of the file, beginning with t, specifies the tree topology. t is followed by the total number of nodes in the tree. r stands for the root, n for interior nodes and l for leaves. The line r 6 n 5 l 0 means that the root is node 6 and it joins the internal node 5 with the leaf 0. The numbers of the leaves MUST correspond to the order of the sequences in the corresponding multi-fasta alignment file, 0 being the first sequence, 1 the next and so on. The numbers of the other nodes must be self-consistent, but are arbitrary. The second second section of the tree file starts with b and lists the branch lengths for the interior nodes. The line 4 0.09787 0.13253 means that the left branch of node 4 (to leaf 1) is of length 0.09787 substitutions per site, and the right branch (to leaf 3) is of length 0.13253. Phylogenetic trees can be computed by many freely available programs (such as the PAML package.)