Phylogenetic Motif Detection by EMnEM

version 1.0

Copyright 2003 by Alan Moses.  This software is provided "as is" without
warranty of any kind.  The author assumes no responsibility for the results it
produces or conclusions based thereupon.  It is distributed free of charge for
academic use only.  Permission to copy and use it is granted free of charge
provided that no fee is charged and this copyright notice is not removed.

Suggested citation: Moses AM, Chiang DY, Eisen MB, Pac Symp Biocomput. 2004

Contents

1. Installing EMnEM on linux
2. Getting started with EMnEM
3. The evm.ctl control file
4. The multi-fasta alignment file
5. The tree file


1. Installing EMnEM on linux

To install, first copy the archive EMnEM.tar to a directory of choice. 
Type the following commands:

tar -xvf EMnEM.tar
make install

the following files should appear in the directory:

EMnEM (the executable)
evm.ctl (the control file)
evomix1.0.cpp (the source code)
Makefile
clustal2fasta.pl (a perl script to convert clustalw output to multi-fasta format)
all.tree (an example tree file)
example (a directory with example files)
EMnEMdoc.txt (this file)

and you should now be able to run the program by typing 

./EMnEM


2. Getting stated with EMnEM

EMnEM needs to read several files to work properly.  
First, a control file, called evm.ctl which must be in the same 
directory as the executable.  This file contains all of the paramters.
Second, EMnEM reads DNA sequence alignment(s) in multi-fasta format.  
Third, if the sequence alignemnt contains >2 species (is more than a
pairwise alignment) EMnEM also needs a phylogenetic tree that relates 
the sequences. Trees are not needed for 2 species (pairwise) alignments.

3. The evm.ctl control file

EMnEM uses a control file in a very similar way to the programs in the 
PAML (Yang, Z. 1997. PAML: a program package for phylogenetic analysis
by maximum likelihood CABIOS 13:555-556) package. The control file is
a plain text file and can be modified in any text editor.


Control file for EMnEM
(commands begin with a dash, so don't use any dashes in comments)
 
motif finding parameters

- i 100         number of iterations. If the likelihood is still increasing after 100 iterations,
		try increasing this.
- w 17          width of motif.  If no matrix or consensus is provided, EMnEM will try
		all the words of length w in the dataset.
 c NNNNNNNNN    start the EM with this consensus. Takes N's but no other IUPAC symbols.
 x example/gal4sites.matrix     initial matrix file.  Start the EM with this matrix.  
- e 2.0         expected number of instances of each motif per sequence.  Set higher if you 
		have a few very long sequences
- t 0.9         posterior prob. threshold for motif instances.  This is the stringency 
		for what gets called a motif.  If no instances are reported, try lowering it. 
- b 0           1 to look on both strands, 0 to look on one
- n 1 		number of motifs to look for.  EMnEM will remove the instances of the prior 
		motifs and then search again.  The instances in the output file, and final 
		likelihood, however, are computed using the entire data, so instances may 
		overlap.  Furthermore, the order that EMnEM finds motifs is determined be 
		The choice of initialization (e.g., consensus vs. all words of length w) so 
		the best motif is the one that has the highest likelihood as reported in the 
		EMnEMscores output file.
 h              turn off heuristics (only use if the likelihood is decreasing a lot)

tree files

- s 0           2=pairwise alignments (no trees required)
                1=use the tree found in all.tree for all the alignments,
                0=use the tree found in name.tree for each alignment name.fasta

initial rate model for the motif
(ratio of the rate for the motif to background at each position.)

- p 0.5         relative rate at position 1
- p 0.5         relative rate at position 2
- p 0.5         relative rate at position 3
- p 0.5         relative rate at position 4
		as many as w positions can be specified. If fewer than w are 
		specified, the rest are set to 1.0 (background rate)

What to do

- u 1           0 = keep starting matrix (to, for example, calculate 
		the likelihood for an existing matrix) 
                1 = update matrix (motif finding)

- r 0           0=keep starting rates (as specified by the p commands)
                1=estimate (ML) JC rate for the motif (1 parameter) 		
                2=estimate (ML) JC rate at each position (w parameters) 
		ML rates are only available for pairwise alignments
                3=Halperin-Bruno selection

Evolutionary model

- m 0           0=jukes cantor  
                1=F81

display

- d 1           0=nothing
                1=print to screen

output

- o 1           print EMnEMscores file.  This file contains the matrix, the
		predicted instances of the motif and the liklihood.


4. The multi-fasta alignment file

If EMnEM finds the evm.ctl control file, it will look for a multi-fasta
alignment file on the command line.  For example, 

./EMnEM example/YBR018C_upstream.aln.fasta

runs the algorithm on the alignment found in the file example/YBR018C_upstream.aln.fasta

./EMnEM example/*aln.fasta 

runs the algorithm on all the alignment files in that directory simultaneously.

Included in the EMnEM distribution is a perl script called clustal2fasta.pl that 
converts clustalw format alignments to multi-fasta.

perl clustal2fasta.pl alignment.aln

will produce a file called alignment.aln.fasta containing the alignment in multi-fasta 
format.


5. The tree file

If EMnEM find a fasta alignment file with more than 2 sequences, it will look for a tree file
that describes their phylogenetic relationship.  If the -s option is set to 1, EMnEM will use 
tree found in the file all.tree for all the sequences.  This file should be in the same 
directory as the executable.  Alternatively, if -s is set to 0, a tree file with the same name
as the fasta file, except the extension .tree instead of .fasta is required.  This file should 
in the same directory as the fasta file.  The tree file is structured as follows

t 7
r 6     n 5     l 0
n 5     n 4     l 2
n 4     l 1     l 3
b
4       0.09787 0.13253
5       0.12665 0.23934
6       0.154675        0.154675


The first section of the file, beginning with t, specifies the tree topology. t is followed by
the total number of nodes in the tree. r stands for the root, n for interior nodes and l for 
leaves. The line 

r 6     n 5     l 0

means that the root is node 6 and it joins the internal node 5 with the leaf 0.  The numbers of
the leaves MUST correspond to the order of the sequences in the corresponding multi-fasta 
alignment file, 0 being the first sequence, 1 the next and so on.  The numbers of the other nodes
must be self-consistent, but are arbitrary.

The second second section of the tree file starts with b and lists the branch lengths for the 
interior nodes.  The line 

4       0.09787 0.13253

means that the left branch of node 4 (to leaf 1) is of length 0.09787 substitutions per site, and
the right branch (to leaf 3) is of length 0.13253.  Phylogenetic trees can be computed by many 
freely available programs (such as the PAML package.)