Guide to DNAWorks

DNAWorks is a computer program that automates the design of oligonucleotides for gene synthesis by PCR-based gene assembly. The program requires simple input information: an amino acid sequence of the target protein or a DNA sequence, and a desired annealing temperature. Additionally, codons can be optimized for an organism of choice, sequences (such as restriction sites) can be excluded from the protein coding region, and flanking sequences (for subsequent cloning or integration) can be added to the protein coding region. The program outputs a set of oligonucleotide sequences with highly homogeneous annealing temperatures, minimal size, and low tendencies for hairpin formation and mispriming by both short and long range repeats. With the help of DNAWorks and a two-step PCR method, synthetic genes of up to 3000 basepairs can be successfully constructed.

A full description of DNAWorks and the method of PCR-based gene synthesis can be found in our recent publication (Hoover & Lubkowski, 2002)

How does the program work?

At the beginning of the program, an initial gene is constructed by reverse translating the protein sequence and joining the 5' and 3' flanks to the reverse translation. Codons are chosen randomly from the set of codons that are above the codon frequency threshold value. If a threshold value of 100 is given, all codons are included in the set, regardless of frequency. This generates a single DNA sequence. In the case of an input DNA sequence, no codons are required, and the sequence is taken as is.

Once a DNA sequence is obtained, the gene is broken into overlaps and oligos. This process of overlap generation is described in our paper, with the following exception. In the original version of DNAWorks, all the overlaps were contiguous so that the oligos are as small as possible. In the current version of DNAWorks, gaps between overlaps are allowed, giving larger oligos but also reducing the cost of oligo synthesis (fewer oligos needed) and simplifying subsequent site-directed mutations (if desired). Because the first and last overlaps can shift (giving a 5' or 3' overhang of 0 - n, with n being the length of the first or last overlap in nucleotides, respectively), there are multiple sets of overlaps, and hence multiple sets of oligos possible. All possible sets of oligos are evaluated and the best set is chosen, as determined by scoring.

An alternative mode of oligo design, termed "thermodynamically balanced inside-out", was developed for cases where problems occurred during PCR synthesis (Gao, et al., 2003). In an assembly set of oligonucleotides, the first half of the oligos are all synthesized in the sense orientation, and the other half are synthesized as reverse complements in the anti-sense orientation of the gene. This was found to improve control and reliability of gene synthesis by stepwise PCR. This mode can be toggled by clicking on the labeled box in the parameter section.

A gene is scored based on a set of features that are critical in the gene synthesis procedure.. The sequence features evaluated in determination of a sequence score are melting temperature (overlap alone), hairpin formation potential (an oligo sequence vs. itself), misprime potential (an overlap sequence vs. the entire sequence), length (of the oligo sequences and the overlap sequences), repeat (the entire sequence vs. itself), GC content (the entire sequence vs. itself), AT content (the entire sequence vs. itself), codon frequency (codon alone) and the presence of restriction sites or sequence patterns (the entire sequence vs. itself). These features are discovered, evaluated, and scored based on absolute values (melting temperature, codon frequency, length), sequence identity (repeat, GC and AT cotent, restriction sites and sequence patterns) and empirical formulas (hairpin, misprime). The scores are applied to the sequence, such that the regions that have unwanted features (potential misprime site, GC rich, Tm outside the desired range, etc.) have the highest local score. The total score is the sum of all local scores. The melting temperature is calculated using the equations of SantaLucia and Hicks. The names of the restriction sites to choose from are entirely from New England Biolabs.

In the case of an input DNA sequence, mutations are not possible, and only the best set of oligos resulting from the overlap generation step are output. Sequences which code for proteins are degenerate, however, and an almost infinite combination of codons is possible for a single sequence. Thus the optimization of a coding sequence is run using a simulated annealing (Metropolis) protocol, which can find a global optimum without needing to evaluate all possible local optima, and in a greatly accelerated fashion.

The gene is silently mutated (codon swap) at a single position. The position is chosen based somewhat on its local score (low frequency, GC rich region, repeat, etc), but somewhat randomly. Then the total score is calculated for the gene. This is a mutation round. If the mutation lowers the score, or if the "temperature" (in simulated annealing) is high enough, it keeps the mutation. Otherwise it reverts back to the original codon. Every mutation round another single silent mutation is generated and evaluated. When enough mutation rounds have gone by and the total score hasn't dropped (arbitrarily set to 6000), the program exits, and the final set of oligonucleotides is printed.

Typically a gene will optimize very quickly (within the first 500 mutation rounds), but much smaller drops will continue for a while. Short simple sequences will drop to zero, and will exit before the 6000 rounds are up. Longer, more complicated sequences will drop in score more gradually, and tend to drag on for much longer before the arbitrary number of rounds finish. The value of 6000 for the cutoff was chosen because otherwise the program would keep churning with smaller and more insignificant drops in score for a very long time, long past the hour cutoff time.

Once the final set of oligos is completed, the program outputs the results. If multiple solutions were requested (Number of Solutions), then the protein sequence is reverse translated as before, generating a new set of oligos, and the process is repeated for each solution. The results are printed to a plain text file that can be emailed to the user, or accessed via the web.

What's new in version 2.4?

Mutant sequence evaluation/entry:

After spending so much time getting a synthetic gene put together, wouldn't it be simple to make 1-3 new oligos for each site-directed mutation and be sure the new oligos will not create problems in the PCR? Well, now you can! Clicking on "mutant sequence" will display the entry form for doing just that. Enter a job name, the mutated sequence (make sure it is the same length as the original sequence), the original logfile and trial number (used for original gene synthesis). The parameters will be set to the same as that of the trial number from the original logfile. Once everything is entered, clicking "Design oligos" will generate the replacement oligos, along with an evaluation of scores for the mutated sequence. The mutation is printed in lowercase font, and it is highlighted in the oligonucleotide assembly.

When creating mutants, look through the new logfile and make sure there are no radical changes in the scores and, most importantly, the Tm histogram. And always make sure that the mutation you designed is what you expected!

What's new in version 2.3?

GC content scoring:

Stretches of high GC content can create deleterious secondary structures, inhibiting the PCR. The latest version monitors for stretches of GC-rich regions of 8 nucleotides or longer and attempts to eliminate them through codon variability.

Length scoring:

While the melting temperatures of the overlaps and the length of the oligonucleotides is directly correlated, the lengths of the oligonucleotides can be modulated to some degree by codon variability. The program will attempt to keep the oligonucleotides from becoming longer than the desired length.

The length score will dominate the solution score in cases where a high Tm and a low oligo length are combined. Thus it would be desirable to enable automatic ranging to find the right balance of length and Tm for a particular amino acid sequence (see below).

What's new in version 2.2?

Codon score disabling:

The codon frequency threshold value will restrict codon usage to those codons which have frequencies equal to or greater than the percent threshold value. However, the top two codons will be used in order to allow some mutational variability. Thus, there is typically no need to maintain a score for codon usage, since the codons are automatically restricted to the highest frequencies.

Disabling the codon score allows for much faster convergence, as well as focusing the optimization on mispriming and repeat minimization. For those users who would still prefer to enable the codon scoring, setting the codon frequency threshold to 100 will turn codon scoring back on.

What's new in version 2.1?

Mispriming analysis:

Mispriming occurs when an oligo binds to an unexpected region of the DNA, stable enough to allow the polymerase to initiate and extend a new strand. This is the likely the biggest reason for gene synthesis failure, as it will lead to alternate and disruptive side products (the long smear, rather than the expected bands on a gel).

To counter this, DNAWorks compares the overlap sequences to the rest of the DNA, and attempts to screen out any sequence which has at least 55% identity with the overlap sequence and that has five consecutive nucleotide matches at the 3' end of the oligonucleotide. The number of potential misprimes are displayed in the real time output ("Mis = #"; the number of repeats are also displayed, "Rep = #"). Any oligo:DNA potential mispriming sites that cannot be screened out are displayed in the final output.

Mispriming can be minimized by keeping the melting temperatures of the oligos as high as possible, and by keeping the oligos as long as possible. Unfortunately, doing so will also increase the possibility of introducing errors from oligonucleotide synthesis. Gene synthesis is very much a balancing act.

What's new in version 2?

Gapped oligos:

The original version of DNAWorks was restricted to oligos being immediately adjacent to each other. This was primarily due to my belief that the oligos should be as short as possible. However, increases in the efficiency of oligonucleotide synthesis and user demands warranted the ability to gap oligos. In version 2, oligos can be as long as the user wants, but no smaller than 20 nt.

More informative output:

The output of each trial now has a diagram of the oligos as they would join together to form the synthetic gene. Arrows display the direction of polymerization, and lines demarcate 10 nt intervals. As in the old version, oligos are alternately displayed in upper and lower case. The translated protein sequence appears below the assembled sequence. The oligos are now numbered from 5' to 3'. This simplifies finding oligos to generate fragments of the synthetic gene.

Here is a sample output with a repeat present:

 The oligonucleotide assembly is:
 ----------------------------------------------------------------
     1       10        20        30        40        50        60
     |        |         |         |         |         |         |

     1 --->                          3 --->
   1 ATGGCGCATCATCACCACCATCATGCC     cgttggcccggaacgccgcctgctggcc
      ACCGCGTAGTAGTGGTGGTAGTACGGGCACGGCAACCGGGCCTTGCGGCG     ccgg
                                                 <---  2
      M  A  H  H  H  H  H  H  A  R  A  V  G  P  E  R  R  L  L  A

     |        |         |         |         |         |         |

                           5 --->
  61 gtgtatacgggcggtaccattgGTATGCGCTCTGAGTTAGGCGTTCTGGTGCCAGGCACC
                       ***********                                < repeat?
     cacatatgcccgccatggtaaccatacgcgagactcaatccgcaagaccacg TCCGTGG
                                                  <---  4
      V  Y  T  G  G  T  I  G  M  R  S  E  L  G  V  L  V  P  G  T

 The oligonucleotide assembly is:
 ----------------------------------------------------------------
     1       10        20        30        40        50        60
     |        |         |         |         |         |         |
 
     1 --->             3 --->                                   
   1 GGGGCTACAGTAGATCGCGtagcgatagctctaaaagtttttggccgttgtgagctggcg
     CCCCGATGTCATCTAGCGCATCGCTATCGAGATTTTCAAAAACCGgcaacactcgaccgc
                                           <---  2               
                                                                 
 
     |        |         |         |         |         |         |
 
       5 --->                                        7 --->      
  61 g CGCCATGAAACGTCATGGTTTAGACAATTACCGCGGTTATAGCC  ggcaactgggtt
         ******    ******                                         < hairpin?
     cggcggtactttgcagtaccaaatcTGTTAATGGCGCCAATATCGGACCCGTTGACCCAA
                       <---  4                                 <-

Internal repeats and hairpins are now highlighted within the sequence for user inspection. Generally the hairpins that are formed are thermodynamically weak, but repeats that can not be eliminated can lead to mispriming and synthesis failures. The repeats shown in the output are only those that occur at 3' ends of the oligonucleotides, and so are likely the most dangerous. Repeats that occur due to amino acid motif repetition are very difficult to deal with, and should be monitored closely.

Automatic ranging:

In the old version, a user had to submit many runs to test various lengths and annealing temperatures. In version 2, entering the value "50-55" in the oligo length textbox allows automatic ranging of length from 50 to 55 nt. The user can then look to the final summary to decide which length and annealing temperature gave the best results.

Faster optimization:

Optimization of the synthetic sequence is more efficient and faster in version 2. This is because scoring is now done on codons, rather than overlaps. Thus the program does not waste time "guessing" which codon is most troublesome to the sequence score. This also makes multiple solutions generally unnecessary, as identical parameters now result in very similar solutions.

PCR condition parameters:

The algorithm for determining annealing temperatures is expanded to include factors for oligonucleotide, monovalent cation, and magnesium concentrations. These factors can change the annealing temperatures of oligonucleotides dramatically. The user can thus anticipate the effects of the final PCR conditions instead of guessing.

The equations and values for determining annealing temperatures are from SantaLucia & Hicks, 2004.

Thermodynamically balanced inside-out mode output:

The method of gene synthesis employed by DNAWorks is termed "thermodynamically balanced", in that all the oligonucleotides should assemble and anneal at the same temperature. The amplification occurs everywhere at once, and ideally can generate the gene with just one round of PCR. However, there are sticky cases where the gene does not amplify, and constructing the gene in pieces is not successful.

A more controlled method of gene synthesis, termed "thermodynamically balanced inside-out", was developed for cases where problems occurred during PCR synthesis (Gao, et al., 2003). In an assembly set of oligonucleotides, the first half of the oligos are all synthesized in the sense orientation, and the other half are synthesized as reverse complements in the anti-sense orientation of the gene. The gene assembly and amplification is thus done in steps of 0.4-0.6 kb from the center pair of oligonucleotides outward.

The new version of DNAWorks allows for the conventional mode output or inside-out mode output. This simplifies the synthesis of oligonucleotides for gene synthesis.

Why Gene Synthesis?

In the post-genomic era, thousands of unknown proteins have become available for study. While in theory the structures and functions of many of these proteins may be determined by comparative analysis (Bork et al., 1998) , in most cases, overexpression and purification of target proteins will be necessary (Baxter & Fetrow, 2001) (Gerlt & Babbitt, 2000). Although the use of naturally occurring genes might appear to be the quickest approach, many such genes will prove to be suboptimal for cloning and overexpression in heterologous systems like Escherichia coli or yeast. The potential problems include high G+C content, codon bias and complex intron/exon structures. An approach to overcoming the complications in cloning is gene synthesis. In this approach, the protein coding sequence can be directly optimized for the expression system of choice. Variants of this strategy include oligonucleotide ligation (Heyneker et al., 1976) , the FokI method (Mandecki & Bolling, 1988) and self-priming PCR (Dillon & Rosen, 1990) . A particularly appealing method, due to its inherent simplicity, is assembly PCR (Stemmer et al., 1995) . This involves generating overlapping oligonucleotides which, when assembled, form the template for the gene of interest. The oligonucleotides are then repetitively extended by PCR, to assemble the full-length gene in a single step.

While this method is simple in principle, in practice numerous complications can lead to errors in the synthesis. To reduce the possibility of errors during oligonucleotide synthesis, the oligonucleotides should be rather short, yet they must still be long enough to provide stable priming overlaps. Any deleterious secondary structures in the oligonucleotides and gene also need to be avoided. Further, the presence of internal repeats within the sequence can cause mispriming, and any overlooked sequences (such as restriction sites or integration-specific sequences) can cause downstream difficulties with subcloning. Therefore, for large proteins with coding sequences of >300 nt, the process of designing these oligonucleotides is tedious and confusing. In the case of a single gene, the problem can be attacked by manual design, but for projects where high throughput is required (i.e. structural genomics) an automated strategy for synthetic gene design is needed.

In practice, the cost of creating synthetic genes is economically competitive with cloning a gene from a cDNA libary. At approximately 35-50 cents per base, a gene encoding a 200 amino acid protein and 25 nucleotide flanking regions would cost about $400 and 3-5 days working time to synthesize. Added to the benefits are predesigned codon optimization and elimination/addition of restriction sites and promoter regions. These factors tip the scale in favor of synthetic genes.

Protein Mode

Protein Sequence

Enter Protein Sequence:

Protein sequences can be entered in one of the following formats: ASN.1, EMBL, FASTA, GCG, GenBank, plain, raw, or SwissProt. All header, comment, and footer information will be discarded, and any characters other than those in the single amino acid code will be removed.

     A = alanine         I = isoleucine       R = arginine
     C = cysteine        K = lysine           S = serine
     D = aspartate       L = leucine          T = threonine
     E = glutamate       M = methionine       V = valine
     F = phenylalanine   N = asparagine       W = tryptophan  
     G = glycine         P = proline          X = stop 
     H = histidine       Q = glutamine        Y = tyrosine

Note:

Upload Protein Sequence File:

The protein sequence can be uploaded from a local disk to the server. The formatting rules for manually entered sequences also apply to uploaded sequences.

Codon Frequencies

Choose a standard organism:

Several commonly used organisms are already entered in a list box for simple access to the program. The codon frequencies for these organisms are based on the number of times each codon is found in protein coding regions of the respective organism's genome. In the case of E. coli, however, the frequencies are for genes that are expressed at high levels during exponential growth, as determined by the factorial correspondence analysis (Medigue et al., 1991). To view the codon frequencies for the listed organisms, click here.

Enter Codon Frequencies:

DNAWorks requires the GCG format of codon frequencies. The codon frequency table should be input as five columns as shown below. The data represent the residue in three letter code, the codon triplet (in DNA, not RNA), the number of codons in the dataset, frequency per thousand, and the fraction used. It is not necessary to align the column fields, as long at least a single space separates the fields.

      Gly     GGG     40359.00     11.39      0.16
      Gly     GGA     34894.00      9.85      0.13
      Gly     GGT     89915.00     25.37      0.35
      Gly     GGC     94608.00     26.70      0.36
      Glu     GAG     66665.00     18.81      0.33
      Glu     GAA    137748.00     38.87      0.67
      Asp     GAT    116164.00     32.78      0.63
      Asp     GAC     67865.00     19.15      0.37
      Val     GTG     85263.00     24.06      0.34

      etc...

     Ala = alanine         Ile = isoleucine       Arg = arginine
     Cys = cysteine        Lys = lysine           Ser = serine
     Asp = aspartate       Leu = leucine          Thr = threonine
     Glu = glutamate       Met = methionine       Val = valine
     Phe = phenylalanine   Asn = asparagine       Trp = tryptophan  
     Gly = glycine         Pro = proline          End = stop 
     His = histidine       Gln = glutamine        Tyr = tyrosine

Upload Codon Frequency Table File:

A file containing codon frequencies can be uploaded rather than manually entered. All format rules for manually entered tables (above) also apply for uploaded files.

Flanking Sequences

Enter 5'-Flanking Sequence:

Enter 3'-Flanking Sequence:

The flanking sequences can be entered in any of the same formats as the protein sequences. Again, all header, comment, and footer information is removed from the input sequence.

Restriction Site / Custom Site Screen

      K = G or T         M = A or C
      R = A or G         Y = C or T
      W = A or T         S = C or G
      B = C or G or T    V = A or C or G
      D = A or G or T    H = A or C or T
	   N = A or C or G or T

Enter Restriction Sites:

There are 180 restriction sites available in list box format, partitioned into non-degenerate and degenerate sequences. Multiple restriction sites can be entered. The restriction sites are limited to 5 nucleotides or longer, and all are available from New England Biolabs.

Enter name and sequence:

The custom site screen allows unlisted restriction sites and novel sequences to be excluded from the protein coding region. The format for each site is a name for the site followed by the sequence:

                 Site_Name_1
                 ATGCAT
                 Site_Name_2
                 CCANNBNNGGT

Parameters

Oligo Length:

The oligo length parameter provides a limit to the length in nucleotides any one of a set of synthetic oligonucleotides can attain. The synthesis of oligonucleotides is subject to errors, mainly deletions, but occasionally mismatches and insertions. The error rate of oligonucleotide synthesis is primarily dependent on length; longer oligonucleotides tend to have more errors (although operator methodology can play a strong role as well -- DNAWorks cannot address technical sloppiness!). To minimize the number of errors in synthetic genes, it is best to keep the oligonucleotide lengths to a minimum.

The oligo length value is directly correlated to the annealing temperature; higher annealing temperatures will result in longer oligonucleotides. Also, to maintain high affinity between oligonucleotides, the oligonucleotides must be long enough to provide decent overlap. Thus, although most program executions with reasonable parameters will result in a set of oligonucleotides whose lengths match or are below this value, the attainment of the desired length is not guaranteed.

Entering the "50-55" will automatically generate solutions for the range 50-55 nt. Thus, multiple parameter sets can be tested at the same time.

Annealing Temperature: Suggested values: 55-75

The annealing temperature parameter sets an ideal annealing temperature for a set of synthetic oligonucleotides. At this temperature, under normal PCR conditions (ionic strength ~100 mM, [Mg2+] = 1-4 mM), all of the oligonucleotides will anneal and assemble cooperatively. The uniformity of annealing temperatures prevents mispriming and/or lack of priming prior to the elongation step, and helps to assure a single uniform PCR product.

As explained above, the annealing temperature and oligonucleotide length are directly correlated. DNAWorks will, however, favor the annealing temperature above the maximum oligo length. Thus, the range of annealing temperatures in a set of synthetic oligonucleotides will always remain much smaller than the range in oligonucleotide lengths, and while oligonucleotide lengths may exceed the input maximum oligo length parameter, the annealing temperatures will always be kept close to the input value.

Entering the "60-65" will automatically generate solutions for the range 60-65*C for annealing temperatures. Thus, multiple parameter sets can be tested at the same time.

Codon Frequency Threshold: Suggested values: 0-100

The level of protein expression depends to some degree on the availability of tRNAs to the growing polypeptide chain on the ribosome. Codons used infrequently often have low levels of their cognate tRNAs. This phenomenon of "codon bias" has been shown to be the case for Escherichia coli expression. Thus, by using only the most frequent codons, the availability of tRNAs ceases to be an issue in protein expression levels.

Further, the high G+C content of the genomes of certain organisms creates problems in cloning genes from these organisms. By optimizing codon bias using equally mixed A+T/G+C codons, the problems involved in cloning are completely avoided.

The codon frequency threshold parameter sets a cutoff for which codons to be used for reverse translation of protein sequences into DNA. For example, a value of 20 will allow only those codons whose frequencies equal or exceed 20% to be used in reverse translation and optimization. However, DNAWorks will always include at least two codons during optimization. Thus, a value of 50 will only allow the top two codons for each amino acid to be used.

Because the set of codons used in optimization is absolutely restricted, there is no need to generate a score for codon frequency. This speeds up the program, and indirectly increases the weight of the other parameters. If the Codon Frequency Threshold is set to 100, then all codons will be used, and the codon frequency score is enabled. This was introduced in version 2.2.

Number Of Solutions: Suggested values: 1-100

DNAWorks uses a random number generator and simulated annealing to optimize multiple parameters simultaneously during execution. Thus, multiple runs will generate solutions of varying levels of success. While smaller genes (proteins of less than 100 amino acids) may not require more than one run to generate a satisfactory set of oligonucleotides, longer genes benefit from multiple runs. The number of solutions parameter will set the number of oligonucleotide sets generated during execution.

Oligonucleotide Concentration: Suggested values: 100 (nM)

The concentration of oligonucleotides in the PCR reaction affects their annealing temperatures. The user can enter the value (in nanomolar) of the oligo concentration for the PCR reaction.

Monovalent Cation Concentration: Suggested values: 50 (mM)

The concentration of monovalent cations (Na+, K+) in the PCR reaction can affect annealing temperatures. The user can enter the value (in millimolar) of the total cation concentration for the PCR reaction.

Magnesium Concentration: Suggested values: 2 (mM)

The concentration of magnesium in the PCR reaction has a profound effect on annealing temperatures. The user can enter the value (in microomolar) of the Mg2+ concentration for the PCR reaction.

Thermodynamically Balanced Inside-Out Mode Output: Default: OFF

Activates thermodynamically balanced inside-out mode output for a more controlled PCR gene synthesis.

E-mail Address

Enter E-mail Address:

Because of internet traffic and congestion on the local server, some jobs may take from several minutes to an hour to complete. This is certainly the case when multiple runs are required. In these situations, the inclusion of an e-mail address allows the user to "fire and forget". The results of the job will be sent upon completion of the job. However, the inclusion of an e-mail address is entirely voluntary.

DNA Only Mode

DNAWorks not only helps to design oligos to synthesize protein coding genes, but non-coding genes as well. In this case, breaking the sequence up into oligonucleotides of equal annealing temperatures is sufficient. The program will output identical histograms as with protein coding regions, but it does not optimize the sequence for PCR-based gene synthesis. Therefore, if there are any repeats or secondary structures, it's up to the user to do something about them.

DNA Sequence

Enter DNA sequence:

Upload DNA sequence file:

The DNA sequence can be in any of the same formats as protein sequence (see above). Only A, C, G, or T will be allowed:

      A = adenosine       G = guanosine
      C = cytidine        T = thymidine

Annealing Temperature: Suggested values: 55-70

See above for details.

Because no optimization is done for DNA only mode, only a single run of the program is necessary, and the job is typically finished instantaneously.

E-mail Address

Enter E-mail Address:

Running DNAWorks in DNA only mode typically gives results instantaneously, so there is usually no need for waiting. However, this feature is only included for consistency.

Some Notes on PCR-Based Gene Synthesis

The method of generating synthetic genes proceeds through four steps:

1. Designing and synthesizing oligonucleotides
2. Gene assembly by PCR
3. Gene amplification by PCR
4. Subcloning the PCR product

Designing and Synthesizing Oligonucleotides:

Use DNAWorks to generate oligonucleotide sequences. Make very sure that everything is what it should be; i.e., no unforseen restriction sites within the gene, sequences are correct, flanking sequences are proper, etc. Then order the oligos you need.

Gene Assembly:

Dissolve primers in appropriate amounts of water to 0.3 mM. Mix 20 ul of each into a single tube. Dilute the oligos to a final concentration of 5 uM each. Mix the oligos into the PCR mixture (buffer, Mg2+, polymerase, etc.) to a final concentration of 0.2 uM. Run standard PCR with the annealing temperature entered into DNAWorks. The elongation time will depend on the length of the gene and the polymerase used. For Pfu Ultra (Stratagene), 30 seconds/1 kb is sufficient. Generally 15 cycles should give enough template for the next step.

This will probably result in a smear on agarose electrophoresis. Not to worry, an additional step will pull out the proper fragment.

For details on the method of thermodynamically balanced inside-out gene synthesis, please see (Gao, et al., 2003).

Gene Amplification:

Dilute outer primers (those primers that match the 5' ends of the PCR product) to 5 uM each. Run standard PCR with the outer oligos at 0.1 to 0.4 uM final concentration, using 1 ul of the initial PCR (the assembly PCR) as template. Again, 15 cycles should be enough to see product on a gel. After PCR, there should be a single band by agarose electrophoresis.

Once PCR is done, purify PCR fragment:

1. Add agarose gel running buffer and run PCR reaction out on 1% agarose gel
2. After staining gel with EtBr, remove correct sized band and place into sterile Eppendorf tube
3. Purify gene from gel

Subcloning:

Cloning of the gene can be done by ligation, topoisomerase (TOPO system), or transposases (Gateway system). See the specific protocols for these methods. If ligation will be used, it may be useful to scale up the Gene Amplification step. I would recommend four tubes of 100 ul each (400 ul total PCR product) to be very safe.

Crucial Parameters

- It may help to run "touchdown" and "hot start" PCR to minimize mispriming during gene assembly and amplification. This should help in producing a single band.

- Optimizing the PCR conditions may also help in minimizing (or elminating) errors introduced during PCR. We have seen an error rate of 2 errors per 1 kb sequenced. Generally, one should not need to sequence more than 3 clones to find one with the correct sequence. In brief tests, keeping oligonucleotide lengths below 45 nucleotides, optimizing PCR conditions, and using Pfu Ultra completely eliminated PCR-introduced errors.

- Low annealing temperatures (55-58*C) are not as critical as the length of the overlap. Make sure that none of the overlaps drop below 12 nucleotides. While the annealing temperature for that overlap may be above 58*C (such as a high G+C sequence), a short overlap may allow mispriming.

Good Luck!

Comments/suggestions/errors to David Hoover
Center for Information Technology, National Institutes of Health
Last modified January 4, 2005