Douglas R. Hofstadfer, Professor of Cognitive Science and Computer Science Adjunct Professor of Psychology, Philosophy, and History and Philosophy of Science Center for Research on Concepts and Cognition Indiana University o 510 North Fess Street Bloomington, Indiana 47408 (812) 855-6965 September 24,1990 Dr. Paul Berg Beckman Institute, Room BO62 Stanford University Medical Center Stanford University 94305 Dr. Maxine Singer Carnegie Institution 1530 P Street N.W. Washington D. C. 20005 Dear Paul and Maxine, I really should know myself better ... I should know that when I decide to get involved in reading a book manuscript, I will wind up spending huge amounts of time on it, putting red all over the place, getting involved on every level, from minute typographical things to global organization. In any case, I had a great time, but now - at long last - I'm done with it! First of all, I want to apologize to you, both for the amount of red (or black, depending on which copy you get) and for the sometimes rather impatient or annoyed tone of my remarks in the margins. I seem to have a very opinionated streak - I like things to be said in certain ways, and I think certain phrases sound wrong, and so on and so forth. I hope you'll take the tone of my remarks with a grain of salt - I have the greatest of admiration for what you've done here, and have enjoyed reading it very much. Needless to say, I've also learned a lot (though I certainly haven't yet absorbed all of it). As you will see, I am a fanatic about clarity in writing - not only clarity of imagery, but also syntactic non-ambiguity. This means that I often suggest rewordings of passages that may look completely fine to an author who already knows what is intended; however, that same passage may have been hard to parse for me, an outsider, and so I have tried to eliminate that alternate parsing. A perfect example of this is on page 6'7, where you write, "In practice, visualizing a pattern of restriction nuclease fragments requires separating the mixture of fragments according to length." First there was "in practice"; was this meant as the opposite of "in theory"? That's what it means to me. However, given the context, I decided that what you probably meant was "in the laboratory". Next came the term "visualizing", which to me has just one meaning - "producing visual imagery" - and it was obvious you didn't mean that. I decided you meant essentially "making [something] visible". Finally, even though I had just read the previous several sentences about how restriction nucleases chop up DNA, I was totally thrown by the phrase "restriction nuclease fragments", until I realized that what you meant was not "fragments of restriction nucleases", but "fragments produced by restriction nucleases" - and in fact, not by a bunch of restriction nucleases, but by a particular one. Putting it all together, I came up with the following rewrite: "In the laboratory, in order to actually see the pattern of DNA fragments produced by a particular restriction nuclease, one must separate the various fragments in the mixture according to their -2- lengths." It's a little longer, admittedly, but it sure makes it easier on the lay reader! Scientists who know what they mean and want to write concisely tend to use lots of "code phrases" whose meanings are completely obvious to a colleague, but which risk evoking completely unwanted meanings in the mind of a non-specialist. For example, on page 75, you use the term "eukaryotic virus". When someone who is completely self-confident about the meanings of both words first hears it or reads it, they pretty much have to come to the conclusion that it refers to viruses that attack eukaryotic organisms, and as soon as they've thus "decoded" it once, it becomes a totally comfortable, problem-free little phrase that they'll probably start using on their own. Very nice. However, to a lay person encountering it for the first time, someone whose mastery of the two terms is somewhat shaky, it will be a confusing phrase. They will think, "How can a virus be eukaryotic? It's not even a cell! How can it have a nucleus? Did I miss something crucial? Are there some very big viruses that somehow have nuclei?" And so on. All these concise little code phrases just add up. "Expression vector". "Restriction nuclease digest". "Anonymous probe". "Retrotransposon". "Viral oncogene". "Transducing phage". "Insertional mutagenesis". It's not that I'm opposed to technical terms - not at all. But there are certain terms that somehow seem to me to be completely clear (perhaps only in retrospect, now that I'm used to them), whereas others seem somehow too concise, and therefore opaque. To me, they seem to impede communication instead of aiding it. This is a subtle matter, and I'm not sure exactly what my recommendation is. Probably there's no simple recipe. I'll just let the suggested rewrites speak for themselves - maybe collectively they'll convey to you my feelings on this matter. My passionate drive for syntactic non-ambiguity has led me to become a big fan of dashes (em-dashes, that is). When they are used properly, their "feel" is just about halfway between that of commas and that of parentheses, which makes them extremely helpful for purposes of clarity. Very often, an appositive phrase set off by commas causes a good deal of potential ambiguity, whereas with dashes it would be absolutely clear. For example, on page 56 of Chapter 4, it says, "Phage, bacterial viruses, can also bring foreign DNA ..." The way it's written could lead a non-savvy reader to think that you're talking about phage and bacterial viruses, whereas with a dash it's obvious that it's an appositive phrase: "Phage - bacterial viruses - can also bring foreign DNA ..." Similarly, on page 20 of Chapter 1, you write, "Almost a century after their independent beginnings, the three separate scientific fields, chromosome behavior and structure, abstract genetic analysis, and biochemistry were unified." As it is, with commas, the part in boldface might be read as a list whose first item is "the three separate scientific fields". With dashes, it is much clearer: "Almost a century after their independent beginnings, the three separate scientific fields - chromosome behavior and structure, abstract genetic analysis, and biochemistry - were unified." Note that em-dashes are really very much like parentheses, in that they tend to go in pairs. In fact, in some European books, I have seen them treated almost exactly like parentheses. For example, the previous example would be typeset as follows: ``Almost a century after their independent beginnings, the three separate scientific fields -chromosome behavior and structure, abstract genetic analysis, and biochemistry- were unified." Note how there are spaces on just one side of each dash, telling you whether it is a Zejt dash or a right dash. While I like this convention, I wouldn't go so far as to recommend it in your book. -3- By the way, real em-dashes are not hyphens, although the dashes in your manuscript were all typed as hyphens. When I learned typing, I was taught to use a pair of hyphens flanked by blanks, although a more common convention leaves the blanks out. I'm sure your publisher will take care of such details, however. A related ambiguity-connected passion of mine concerns compound modifiers; I feel they should almost always be hyphenated. For example, on page 86 you start a paragraph with "By the late 19'7O's, it was clear that the protein coding sequences in a eukaryotic gene...". A non-specialist could easily interpret this, at least at first, as "it was clear that the protein, coding sequences in a eukaryotic gene, ..." With a hyphenated modifier, it becomes impossible to misinterpret: "By the late 19'7O's, it was clear that the protein-coding sequences in a eukaryotic gene ..." On page 92, another paragraph starts, "For example, the maturation of an RNA polymerase I1 primary transcript of a protein coding gene into a messenger RNA requires several steps." Here you really could benefit from hyphens: "For example, the maturation of an RNA-polymerase-I1 primary transcript of a protein-coding gene in to a messenger RNA requires several steps." Even here, though, using a long compound noun ("RNA polymerase 11") as a modifier in front of "primary transcript" is quite confusing. Why not spell the whole idea out a little less concisely and a little more directly? It could go something like this: "For example, it requires several steps for the primary transcript of a protein-coding gene, transcribed by RNA polymerase 11, to mature into a messenger RNA." I have the impression, not just from your book, that usage of multiple complicated compound modifiers is rampant in molecular-biology articles. This may not confuse technical colleagues, but it can wreak havoc with less with-it readers. My favorite example of a problem with compound modifiers in your manuscript is found on page 50, where you refer to "a ribosome-transfer RNA-mediated process". When I first hit this, I scratched my head and wondered, "What ribosome-transfer process? I don't recall any such thing. What do they mean? And although I admit the process they were just talking about could be called `RNA-mediated', I wouldn't have put it that way ..." Then all of a sudden it hit me that you were referring to transfer RNA, and that what you meant was essentially "a ribosome-and-tRNA-mediated process". However, a far better way to phrase it would be "a process mediated by ribosomes and transfer RNA". The blank space between "transfer" and "RNA" in your original phrase, however, threw me totally off, making me perceive two hyphenated compound modifiers, even though I have read a million times about tRNA, and even lectured and written about it numerous times. By the way, while we're on the topic of RNA, I myself prefer writing "tRNA", "rRNA", and "mRNA" to "transfer RNA", "ribosomal RNA", and "messenger RNA" - they're more concise and each one of them ought to be an independent concept in the reader's mind. It's sort of like "DNA" and "RNA" - you wouldn't want to write out their full names every time. I'm not suggesting that readers not be told what the "t", "r", and "m" stand for -just that you ought to allow yourselves to use those abbreviations in spots, where it would make things a little easier. On page 1'76 you write, "The introduction of a different oncogene, my, under the control of a mouse mammary tumor virus promoter, yields mammary tumors." I found the phrase "a mouse mammary tumor virus promoter" hard to decipher. Here's a suggested rewrite: "Mouse mammary tumor virus promoter controlled alternate oncogene introduction yields mammary tumors." (Just kidding!) -4- My most outspoken comments, I think, come at the top of page 41, where, on the topic of the genetic code, you wrote the following: "The challenge was to decipher the code and to learn how it is translated from DNA to protein." Firstly, I object to the word "decipher" applied to the genetic code. "Crack the genetic code", okay - but "decipher"? No way! That verb applies to messages written in a code, not to the code itself. Thus during World War 11, the British deciphered German messages sent to the U-boats, and they did so by cracking the code in which those messages were written. But they didn't decipher the code. (I feel like William Safire, I must say!) But a worse sin was that you then went on and said, "to learn how it [the genetic code itself] is translated from DNA into protein". Shame, shame! It reminds me of when people say things like, "They discovered that genes were genetic codes for proteins." I So much for syntactic aspects. One of my more important content-related concerns had to do with the way in which, in the first few chapters, you slide into talking about DNA, genes, and chromosomes as "information". It seems to me that historically, there were roughly four distinct stages of the concept of "gene". They go something like this: cringe when I hear such things! c \. (1) a gene as an abstract entity responsible for a particular inherited trait (Mendel) (2) a gene as a physical piece of a particular chromosome (T. H. Morgan) (3) a gene as correlated, on a holistic level, with a particular protein (but no real sense of coding- Le., one-to-one sequential matchup of pieces of the (/ `Ii. '`Y -. ~ __I 'p 1- r <3, gene with pieces of the protein) (Garrod, Beadle, Tatum) Y (4) a gene as a linear chain codingfor a particular protein in a sequential, I: ( i v piece-by-piece manner .$ My feeling is that you start using words like "information" and "encode" when you have gotten readers to stage 3 but not yet to stage 4. To me, this doesn't work. When all one knows is that particular genes are correlated with particular proteins as wholes, one won't think of genes as constituting information. There's no sense of "reading" a gene until you think of genes as linear structures composed of letter-like entities, and proteins as other types of linear structures composed of some other kind of letter-like entities. Then and only then, in my opinion, does it start to make sense to talk in terms of "coding" and "information". There is a related comment on page 40 about the word "express", which I'll retype here just so you can think about it in advance. What I wrote was essentially this: "When a normal person speaks about how a message is expressed, they mean how it is put into the medium (here the medium is DNA, but in general, it is spoken or written language). By contrast, when a molecular biologist talks about the expression of a message, they mean how it is gotten out of the medium (DNA or RNA)." Thus there is a curious contrast between everyday and technical uses of the same word. The way I see it is that in biology, there are really two media - nucleotides and amino acids - and you are taking an input message in one medium and expressing it in the other medium. It's just that no one speaks of proteins as "messages"; only genes (and possibly mRNA) are described as "messages". You'll notice that I object numerous times to the word "particle" to refer to things like viruses, ribosomes, etc. It's probably just my physics background showing through, but I've never liked that word in biological contexts, because to me, "particle" -5- carries very strong connotations of indivisibility and fundamentality, as epitomized by electrons, quarks, and so on. Another minor hobbyhorse of mine is the term "catalyze", in reference to the function of enzymes. Although strictly speaking, it is the right term to use, it conjures up entirely the wrong imagery for me (and, I would suspect, for most lay people) - namely, the idea that a given enzyme's presence mildly increases the speed of some reaction (doubling or tripling it, say), when in fact the enzyme is so phenomenally catalytic that it speeds the reaction up by a factor of a million or a billion! For this reason, I prefer simply to say that an enzyme cam`es out a reaction, even if strictly speaking it's not true. Everybody speaks of enzymes as "cellular machinery", and that's what they mean! An overall comment on the diagrams: they are not caricatural enough for my taste. They also contain too many technical terms and symbols for me. What I want ,/?g,,wn.. to see is a situation stripped down to its bare essence, rendered maximally simple and maximally clear, rather than clothed in lots of details. And, incidentally, I think your captions are generally too terse - I would like to see them essentially self-sufficient, as in Scientific American articles. Finally, I come to a few topics that I'd like to see you expand on. My absolute favorite one is overlapping genes. Ever since I first read about such things in 4x174 DNA, I was fascinated. The whole idea is so weird, so much like biological puns or double entendres. There are apparently two types - the shifted-reading-frame type (as in 4Xl74), and the nonsense-strand type (Le., where the two genes face each other on complementary strands). Both types are so amazing and so virtuosic. How common is this phenomenon? Does it happen only in viruses? Does it ever happen in prokaryotes or eukaryotes? How could such tricks have evolved? A related question is the extent to which regulatory sequences (including out-of- frame start codons and stop codons) might appear by accident inside coding sequences. - Why shouldn't this happen occasionally - in fact, quite often? And how would inducers, repressors, RNA polymerases, and other related enzymes recognize such "accidents" for what they were and ignore them? A question related to this one is how mistakes in copied DNA are recognized and corrected. How in hell can a dumb little enzyme inspect an isolated piece of DNA and know that it contains an error? Or doesn't it happen that way? Are mistakes corrected right at the moment of copying, when both strands - master and copy - are available for inspection and comparison? Another question was prompted by something you wrote on page 127: "Binding [of an antigen to an immune receptor] also results in secretion of antibodies with the same variable-region binding-site [the hyphens are mine] and therefore the same antigen specificity as the receptor." I wrote in the margin that this seems to me to imply an "illegal" information flow (contrary to the Central Dogma, that is) - namely, from protein (the immune receptor) to DNA (that of the B cell) and then back out to protein (the antibody). This is probably an incorrect image that I have, but I wonder what really goes on. Can you explain this to readers in a bit more detail? I have one final question, prompted by my interest in so-called "genetic algorithms" (computer models of intelligence inspired by evolution): Is it meaningful to say which is more important as a driving force in evolution - mutation or recombination? My colleague and friend John Holland, at the University of Michigan, maintains that mutation is just a small force in evolution, and that what i, 0 I I 3 -6- really drives it forward is recombination. He has mathematical arguments that in some sense prove this claim, but I wonder if it is a generally accepted notion, or if biologists would dispute it. Well, finally I have come to the end of this letter. I hope you find my comments helpful and not depressing. Please note that I didn't suggest any global changes - almost everything is pretty local, and therefore not all that hard. I really am looking forward to the final version of the book - it ought to be great! And I'm sure that on my next read-through of it, I will absorb considerably more. It's been fun. Sincerely, P.S. - In case you have questions about suggestions I have made, feel free to contact me by letter or by phone (I'm usually at home: (812) 333-4334). I am keeping a photocopy of the whole thing.