Monkey (doc file updated May 2006) version 2.0 copyright 2004-2006 by Alan Moses. This software is provided "as is" without warranty of any kind. The author assumes no responsibility for the results it produces or conclusions based thereupon. It is distributed free of charge for academic use only. Permission to copy and use it is granted free of charge provided that no fee is charged and this copyright notice is not removed. Contents I Introduction II Getting started with monkey III The matrix file IV The alignment file V The tree file VI The .occ output file VII More options VIII rMonkey I introduction This program is designed to search aligned non-coding DNA for matches to a known motif matrix. It runs on multiple alignments (2 or more sequences) of non-coding DNA and provides a variety of information of about the putative motif instances that it encounters. The basic operation of the program is to scan a single 'reference' genome (which can be specified by the user) ignoring gaps. For all regions of the 'reference' genome that may be instances of the motif, the corresponding region of the alignment is examined using a model of motif evolution. The sequences from each genome corresponding to this region are reported, along with the various scores and p-values that have been computed. Monkey uses a statistical model to report a pvalue that is an estimate of the probability of the observed match in random genomic alignments. II Getting started with monkey To install monkey on linux type tar -xvf monkey2.0.tar this will create a directory called monkey2.0 . Change to that directory, and compile the program. $ cd monkey2.0 $ make all Monkey can now be run by typing ./Monkey Monkey requires two command line arguments: the names of the matrix file and the fasta alignment file. The formats for these will be discussed below. The simplest run of monkey might look like $ ./Monkey gal4sites.matrix example/YBR018C_upstream.aln.fasta which scans the alignment example/YBR018C_upstream.aln.fasta for sequences matching the matrix in gal4sites.matrix. The results should appear in a new file called example/YBR018C_upstream.aln.occ. Monkey can be run on windows under cygwin (http://www.cygwin.com/). Once the make and g++ compiler have been installed, just follow the instructions for linux. III The matrix file Monkey must read in the frequency matrix representing the sequence specificty of the motif. The format for this 'matrix' file is simply four columns of numbers, with each row representing the frequencies of A, C, G and T for one position in the motif. Frequencies should be greater than 0 and sum to one. An example matrix file is provided with the distribution. IV The alignment file Monkey reads in a 'multiple fasta alignment.' The format for file is as a fasta file, with gapped positions represented by '-' and nucleotides represented as A, C, G and T. Other characters, such as 'a','c','g','t' or 'N' may be included, but columns containing these will be ignored. Examples of these files are included with the distribution and the Monkey distribution includes a perl script (clustal2fasta.pl) to convert clustalw format files to multiple fasta. Typing $ perl scripts/clustal2fasta.pl filename.aln will create a multiple fasta file called filename.aln.fasta. V The tree file If the fasta alignment file contains more than 2 sequences a tree file is required to specify the tree topology and branchlengths. This can be estimated using one of many availble phylogentic analysis packages (such as paml). The tree file is structured as follows t 7 r 6 n 5 l 0 n 5 n 4 l 2 n 4 l 1 l 3 b 4 0.09787 0.13253 5 0.12665 0.23934 6 0.154675 0.154675 The first section of the file, beginning with t, specifies the tree topology. t is followed by the total number of nodes in the tree. r stands for the root, n for interior nodes and l for leaves. The line r 6 n 5 l 0 means that the root is node 6 and it joins the internal node 5 with the leaf 0. The numbers of the leaves MUST correspond to the order of the sequences in the corresponding multi-fasta alignment file, 0 being the first sequence, 1 the next and so on. The numbers of the other nodes must be self-consistent, but are arbitrary. The second second section of the tree file starts with b and lists the branch lengths for the interior nodes. The line 4 0.09787 0.13253 means that the left branch of node 4 (to leaf 1) is of length 0.09787 substitutions per site, and the right branch (to leaf 3) is of length 0.13253. Phylogenetic trees can be computed by many freely available programs (such as the PAML package.) The distribution contains a perl script (dnd2tree.pl) to convert 'Newick' format files to the tree format used for monkey. This script requires that the species occur in left to right order corresponding to the top to bottom order of the fasta file. In addition, branchlengths must be specified in 'X.X' format, (such that 0 must be written as 0.0 or 0.000), the tree be strictly bifurcating (including the root) and does not support internal node names. Monkey2.0 includes a parsing script 'nhx2tree.pl' to convert newick files to the format used for monkey. This script should handle the full specification of the extended newick format (http://www.genetics.wustl.edu/eddy/forester/NHX.html) and will produce warnings if the tree is not strictly bifurcating, or does not contain branchlengths (requirements for monkey and Rmonkey). As with dnd2tree.pl, the left to right order of species names in the tree should agree with the top to bottom order in the multi-fasta file. If the alignment contains more than 2 sequences, Monkey will by default look for a file with the same name as the multiple fasta file, except with the extension '.tree'. It is possible to specify a tree file using the -tree option. For example, typing $ ./Monkey gal4sites.matrix example/YBR018C_upstream.aln.fasta -tree all.tree uses all.tree instead of example/YBR018C_upstream.aln.tree For pairwise alignments, no tree is required and Monkey will by default calculate the distance between the two sequences. If -tree is used it should be followed by a number - monkey will take this to be the distance between the sequences (in substituions per site.) VI The .occ output file Monkey will produce a file with the same name as the multiple fasta alignment file that was given as input, except the extension will now be .occ instead of .fasta. This file is in tab delimited format. Each row in this file represents a putative instance of the motif. It is possible to specify a different filename using the -out option. For example, $ ./Monkey gal4sites.matrix example/YBR018C_upstream.aln.fasta -out file.occ will write the output to file.occ. The first 7 columns in the .occ file contain 1) the name of the fasta file and matrix file, 2) a probabilistic estimate of which strand the binding site is on, where 1 is the forward strand, 0 is the reverse, 3) the position in the alignment, 4) the position in the ungapped version of the reference sequence, 5) the segment of reference sequence, 6) the single sequence liklihood ratio score and 7) the p-value associated with that score. The next group of columns contain the sequences and single sequence liklihood ratio scores for all of the other sequences in the alignment. Finally, the last three columns contain the p-value associated with the score under the evolutionary model, the % identitiy of the subsequences in the alignment, and the % ungapped. If the percent ungapped is high, monkey will heuristically modify the alignment and calculate a p-value. If the percent gaps is large, monkey will print out '-'s instead of the sequences and p-values. VII More options -help To see all the options currently available type $ ./Monkey -help -m Monkey allows the use of several scores based on different assumptions about the evolution of the motif. The default is to use the Halpern-Bruno model. -b This specifies the background substitution model. The Default is to use the Jukes-Cantor (JC) model, but the HKY model is also available. If the HKY model is used (by adding -b HKY to the command line) the transition transversion rate ratio, or kappa, should also be specified using the -kappa option. The default is kappa=2.0 . -list If the same tree is to be used for searches of many alignment files, there is a considerable speed advantage to scanning them all at once. Monkey allows the user to provide a file containing a list of alignments. $ ./Monkey gal4sites.matrix eg.list -tree all.tree -list will scan all of the alignments listed in eg.list using the tree found in all.tree, and write the output to a single file called eg.list.occ. -scr Prints out the evolutionary score as well as the associated p-value. -ref specifies which sequence in the alignment file is to be treated as the reference. This is done by adding -ref n, where n is the position in the file of the desired reference sequence, starting with 0. -cut This option can be used to reduce the total amount of hits produced in a scan of many alignments. This is the minimum score that a binding sites must have in the reference genome before it is considered for evolutionary analysis. The default is to use a score of zero. -freq To specify a background frequency file use -freq followed by the filename This file has the following form. A 0.3 C 0.2 G 0.2 T 0.3 This means that average frequencies in the intergenic region for A, C, G and T are 30%, 20% 20% and 30% respectively. If a file is not specified monkey will calculate these frequencies from the sequences given. VIII rMonkey Rmonkey is another program to detect matches to specificity matrices is multiple sequence alignments. Differences between rMonkey and the original monkey. rMonkey uses a different heuristic to identify binding sites in alignments (see Moses et al 2006). Breifly, rather than adjusting the alignment based on the number of gaps in the cloumns that correspond to the binding site, rmonkey selects an reference match, and then choses the best overlapping matches to the matrix in each other sequence. This makes rmonkey more 'agressive' at assigning orthology to binding sites, and it can often consider non-orthologous sequences to be aligned binding sites, Unlike the original monkey, in which the 'reference' species was held constant for a given run, rmonkey includes an option (specified by adding -any to the command line) which will allow the 'reference' sequence for matches to be in any species. In addition to the p-value assusiated with the evolutionary generalization of the likelihood ratio score (as in monkey), rmonkey computes p-values associated with a conditional likelihood ratio, which assumes a match in one 'reference' species has already been observed (see Moses et al, 2006). These p-values can be take a long time and use a lot of memory to calculate; they can be disabled using the -M option. A final difference between rmonkey and monkey is that cutoffs in rmonkey are in p-values rather than likelihood scores, and p-values are returned for each sequence by default. Single sequence p-values are computed by a different method, and will differ slightly (but be more accurate) than those produced by the original monkey. Inputs The input files to rmonkey are the same as for the original monkey, but some of the options have slightly different behaviours in rmonkey. -scr produces scores for each sequence and the evolutionary log likelihood ratio instead of p-values -cut specifies a single sequence p-value cutoff for a match to be included in evolutionary analysis Outputs The rmonkey output file is very similar to that of the original monkey. rMonkey produces a tab delimted file with one row corresponding to each match to the matrix. The first 5 columns contain the names, strand, aligned and ungapped positions in the reference sequence, and the sequence for the match in the reference. Note that positions in the aligned sequences are now given as 'start'-'stop'. The next column gives the single species p-value for the match in the reference. The next sets of columns give positions, sequence and single species p-values for each of the orthologous matches. The next column gives the p-value associated with the likelihood ratio score for a conserved match. This is the p-value computed in the original monkey. If a sequence contains too many gaps, rmonkey reports a 'g' instead of the '-' character used in the original monkey. Finally, the last three columns give the conditional likelihood ratio (T statistic, see Moses et al. 2006), and associated p-values under either the HB null hypothesis (a test for lack of conservation) or the HKY null hypothesis (a more conservative test for conservation). When should you use rMonkey? -If you are trying to find all the binding sites that might be conserved (and are willing to include some that are actually not conserved) -If you are trying to be conservative in identifying non-conserved bindng sites. When should you not use rMonkey? -If you are doing an evolutionary analysis of the bases in the binding sites - in rmonkey the alignment is modified in a way that depends on the matrix. -If you are interested in a region that contains many partially overlapping binding sites, and want to preserve the orthology relationships between them.