Reprinted from FORMAL REPRESENTATION of HlJ!~~24N JUDGMEPJT, Benjamin Kleinmuntz (ed.), 1968. Copyright @ 1968 by John Wiley & Sons, Inc. CHAPTER 7 Joshua Lederberg and Edward A. Feigenbaum Stanford University MECHANIZATION OF INDUCTIVE INFERENCE IN ORGANIC CHEMISTRY + INTRODUCTION A paradigm of the scientific method is the successive alternation of reference to hypothesis and datum. We start with a hypothesis, which moves us to select some aspect of the real world for empirical enquiry. Sensory impressions are translated by convention into data. More refined hypoth- eses are then induced by a poorly understood process which contains at least two elements: (1) the data somehow suggest a hypothesis, and (2) deductive algorithms are applied to the hypothesis to make logically necessary predictions; these are then matched with the data in a search for contradictions. A hypothesis is regarded as inductively proven (i.e.. we have achieved a scientific discovery) when its predictions are satisfied, and when we have the illusion of inductive exhaus- tion that no other hypothesis will lead to equally concordant predictions. It is rare for inductive exhaustionto be rigorously justified. Usually, the process is reiterated many times: each refinement of hypothesis suggests the examination of new data; each new datum leads to a discrimination among existing hypotheses. or suggests another refinement. Our main contribution is, perhaps, the suggestion that organic chemistry is an apt field for the mechanization of the process of scientific induction. It can be circumscribed so that the first studies are simplified without undue loss of *The research reported here was supported in part by the Advanced Research Projects Agencyofthe Office ofthe SecretaryofDefense (SD-183 1, and in part by the National Aeronautics and Space Administration (NsG 81-601. 187 188 Formal Representation of Human Judgment utility or generality. Real data of any desired level of com- plexity can be adduced. Few natural sciences are so rich in inductive analysis from information that can be presented in a simple, uniform format. By contrast, genetics or embryology are sciences that might invoke models of most of external reality. Above all, the hypotheses of organic chemistry can be abstractly represented, that is, as structure diagrams. These lead, in turn, to an algebra for inductive exhaustion rarely available in any other scientific field at the present time. As will be seen, our program (named DENDRAL) is firmly rooted in this algebra which can generate an exhaustive and irredundant list of hypotheses from initial contextualdata (1,2,3,4). The problem of inductionis then reduced to efficient selection from this prospective list. This is only feasible if the experimental data are recurrently consulted to guide the hypothesis-generator. Unproductive branches of the genera- tion tree are anticipated and avoided as soon as possible; conversely, the data are used for heuristic reordering of the priorities with which the hypotheses are brought up for ex- amination. The program thus simulates a systematic idealiza- tion more than it does the haphazard evocation of new concepts in human intelligence. HEURISTIC DENDRAL: A SUMMARY FROM THE STANDPOINT OF ARTIFICIAL INTELLIGENCE RESEARCH The intent of this section is to present in a succinct and compact fashion information relevant to a general understand- ing of what our program does and how it does it. Motivation and Task Environment We have been interested in exploring processes of em- pirical inquiry, particularly discovery processes involved in searching a hypothesis space for hypotheses meaningful and relevant to the explanations of real-world data. Some practical considerations concerning the automation of routine scientific endeavor in particular environments, using heuristic pro- gramming techniques, also supplied some of the motivation. We were interested also in exploring man-machine interaction in the context of scientific problem solving, not only as an Mechanization of Inductive Inference 189 augmentation to human problem solving processes ("smart scratch paper") but also as a means for "educating" a suitably receptive program, by making it easy for a human skilled in the task area to impart to the program his heuristic search rules and other information relevant to good performance in the task. The task area chosen was the analysis of mass spectra of organic molecules. The hypotheses relevant to explaining mass spectral data in organic chemical analysis are essen- tially graphs--molecular graphs consisting of atoms and bonds, such as are seen in textbooks on organic chemistry, The main tasks presented to the program at present are: 4 1. Given a chemical (compositional) formula, output a list of the chemically most plausible isomers (structuralvariants) - of the composition, ordered from most plausible through least plausible if that is requested. 2. Given a mass spectrum and a composition, output a list of the most plausible isomers of the composition in the light of the spectral data given: Restated, generate a hypoth- esis or list of hypotheses ,to best explain some given spectral data. Proceeeeo, Algorithmic and Heurietic The program that solves these problems is called Heuristic Dendral. It `is a LISP program of some 30-40 thousand words, developed on the SDC Q-32 time-sharing system and presently operating on the PDP-6 at the Stanford Artificial Intelligence Project. Though the program consists of many functions, the most important activities can be summarized as follows: 1. At the most basic level, there is an algorithm, called the Dendral Algorithm, rarely exercised without constraints. Given a chemical composition,, it will generate all of the topologically possible noncyclical connected graphs that canbe made from the atoms of the composition, given the valences of these atoms. Associated with the Dendral Algorithm is a notation for these graph structures, called Dendral notation. Canonical forms of the graph structures in Dendral notation exist and are used. The Dendral Algorithm is a systematic 190 Formal Representation of Human Judgment and exhaustive "topologist" and knows nothing about chemistry, except the valences of atoms. But, using a chess analogy, it is the "legal move generator," the ultimate guarantor of the completeness of the hypothesis space. 2. Heuristic processes control and limit the generation process (i.e., prune the implicit generation tree). Taken together, these heuristics constitute the program's "chemical model." This model includes: a list of denied embedded subgraphs, the existence of any one of which in a strucutre rules out that structure as a plausible hypothesis; a list of well-known, stable, and generally highly significant radicals which are treated in an aggregate fashion as "superatoms," or essentially higher level concepts; an evaluation function' not dependent on spectral data that evalutes the potential fruitfulness of attempting to generate structures from a col- lection of as yet unassigned atoms (i.e., evaluates the worth of pursuing a particular subproblem); a data matching process that we sometimes call "the zero-order theory of the mass spectrometer" that makes decisions about the relevance of the subproblems based on the actual mass numbers present in the given spectrum; a "rote memory,"called the Dictionary, which is the memory of previously solved subproblems, correspond- ing to the theorem memory in theorem-proving programs; and a few other heuristic processes of lesser importance. 3. Learning processes in Heuristic Dendral are relatively simple, and a high order of learning by the program itself remains more of a goal than an accomplished fact. The main "internal" learning process is the Dictionary building activity. Learning on the subproblem evaluation function a la Samuel's Checker Program is possible, but not implemented. "Ex- trinsic" learning, in the sense of a human expert communicat- ing to the program the elements of the chemical model, has been extensively and successfully used. Perhaps this is high level programming, but in the same sense that pedagogy in general is high level programming activity. ORGANIC CHEMISTRY THE PROBLEM-CONTEXT OF DENDRAL The fundamental problem of organic chemistry is the topological structure of a molecule. This was first brought into Mechanization of Inductive Inference 191 focus by the Swedish chemist, Jons Jakob Berzelius (1779- 1848) when he established the occurence of chemical isomers. These are different organic molecules having the same chemical composition or ensemble of atoms; hence they have different structures (i.e., connectivities of the atoms with re- spect to atom-to-atom bonds). For one of the simplest ex- amples, take C,H,O, which has the two isomers, dimethyl ether and ethanol (Fig. 7.1). Todetermine that the composition of a compound once obtained as a pure sample, say C2H60, is essentially a mechanical process of quantitative analysis. To assign it to one of the possible isomers is a much more demanding intellectual exercise. H H H-Lo-&-H !l k HH H-C-&-H IilL 2 H3 ,CH, \CHs CHa `OH (4 (b) Fig. 7.1 Two isomers of CzHsO: (a) Dimethyl ether (O..CH3 CH3), (b) Ethanol or ethyl alcohol (CHZ..CH3 OH). Each of these may also be represented by isomorphic graphs. for ex- ample, for ethanol: (CH3.CHI.OH) or (OH.CH2.CH3) or (CH2..OH CH3). The previous notation is in canonical DENDRAL form, being initialized by the center of the graph, followed byadot for each radical. and then a list of radicals in order of an algorithmically defined value. For internal repre- sentation or more compact coding, the H's can be dropped, leaving us with O..CC and C..CO respectively. Where symmetries prevail, we can go one step further, using the `/I as a ditto mark, as in (O./C) for (O..CC). This economy is, of course, trivial here, but not so in more complex formulas. At this level of analysis, structure means connectivity, not geometry. In fact, with the help of X-ray diffraction analysis, a great deal can be learned about the actual disposi- tion in space of the atoms in a molecule in the crystalline state. However, the molecules, especially in the liquid or gaseous states, may be undergoing a variety of dynamic transitions- linear, rotational, and rocking modes about every chemical bond. Chemical geometry is beyond the scope of the present 192 Formal Representation of Human Judgment discussion, but what we know of it could be superimposed upon the topological frameowrk developed below. The preceding paragraph canbe summarized: a chemical structure is represented by an undirected graph whose nodes are atoms, whose edges are chemical bonds. While the analogy was recognized 100 years ago, this outlook has still to penetrate the teaching of organic chemistry. In practical problem solving the chemist uses every possible datum. For example, smell can help him decide be- tween dimethyl ether and ethanol, if he didnot already recog- nize that the ether would be much more volatile than its isomeric alcohol. He also has a repertoire of reagents that can help to detect various fragments (called radicals) in the molecule (e.g., -OH). More recently a specialized instrument, the mass spectrometer, has been developed which facilitates a unified systematic attack on structural problems, Briefly, a molecule is bombarded by an electron beam which sputters off an electron, leaving a positively charged molecule-ion. A fraction of these fragment, giving radical ions of various sizes corresponding to different modes of cleavage, often com- plicated by further rearrangements and reactions of fragments. Finally, the ensemble of molecule- and radical-ions is re- solved by careful acceleration through electrostatic and mag- netic fields. The utility of the mass spectrometer and some examples of the logical inference employed in exploiting it are reviewed by McLafferty (1966). The mass spectrum is a paired list of mass numbers and their relative intensities. Mass spectrometers of very high resolution have been built, capable of distinguishing between radicals of different composition but the same integer atomic weight. For example, the radical -NH, M =15.0110 can be distinguished from the radical -CH,, M =15.0215. This cap- ability is especially useful for determining the formula of the . intact molecule. Unless we specify otherwise, however, we have in mind the more ordinary low resolution mass spectrom- eter which lumps together species having the same integral mass. However, more precise data are readily accommodated and avidly used by the program logic. The stated goal of our program is thenan inductive solu- tion of the mass spectrum. That is, a molecular formula and its mass spectrum are given as data, We must induce the Mechanization of Inductive Inference 193 structure (hypothesis) that best satisfies the data. Our basic approach to this has been first to furnish the computer with a language in which chemical structure hypotheses can be expressed, then to interrogate chemists and their literature for the rules and techniques they have used in problem solving and attempt to translate these into computer algorithms. In the course of searching for these heuristics, we have in fact discovered a number of algorithms which are much more systematic than the approaches commonly used by chemists in this field. ISOMERS Underlying the solution of virtually every problem and subproblem in structural organic chemistry is the potential exhaustion of the list of possible isomers of a given molecule or radical. It is remarkable that while hundreds of thousands of students of elementary organic chemistry are challenged in this way every year, no algorithmfor generating and verifying complete lists of isomers has hitherto been presented. Each student is left to work out his own intuitive approach to this problem, which may account for the bafflement with which very many students approach the subject upon their first ex- posure to it. The core of DENDRAL is a notation for chemical struc- tures and an algorithm capable of producing all distinct iso- mers and casting each of theminto a canonical representation. This will be outlined in more detail further on. The lowest level of DENDRAL might be called the topo- logist. This machine considers only the valence rules and elementary graph theory in constructing lists of isomers. It uses two elementary concepts, one, the center of a graph as a point of departure, andtwo, arecursiveprocedurefor evaluat- ing a radical as a way of specifyingthe canonical representa- tion of a given molecule. After the center of the map is fixed, being either a bond or an atom of known valence, the radicals pendant on the center must be listed in nondecreasing value. The apical node of each radical is then regarded as a new center and the process continues recursively. The same approach can be used to make a generator from DENDRAL. From the formula or composition list, a bond or a 194 Formal Representation of Human Judgment given species of atom is first taken as the central feature and the remaining atoms partitioned in appropriate ways, and these partitions assigned tentatively to the pendant radicals. For each radical then successive allocations are made for the apical node and then partitions are allocated to the pendant subradicals, and so forth. TABLE 7.1 Canons of Dendral Order* (Hierarchy of Vector Valuation in Decreaeing Order of Significance) The DENDRAGVALUE of a Radical Consists of the Vector: COUNT Rings by number of rings t Other atoms (except II) COMPOSITION of radical Rings t by valuation of ring Compoeition, Vertex Group, Path List, Vertex List, Substituent Locations Other atome by atomic number (S, P, 0, N,C) UNSATURATIONS (afferent link included; ring paths excluded) APICAL NODE Ring Value t Degree : number of efferent radicals Composition: e.g. (S, P, 0, N, C) Afferent Hnk: (i, :, .) APPENDANT RADICALS if any (nested vectors in canonical order): nil if current apex is terminal Enantiomerism around apex (DL, D, L, unspecified) if applicable *From J. Lederberg (1964), DENDRAL64, NASA CR-57029, Star No, N65-13158 t Rings are not discussed in this paper. Each line above is a separate cell or subcell of the vector. Table 7.1 summarizes the order of allocation for evaluat- ing a radical, and for generating the next structure, a pro- cedure used recursively in the DENDRAL program in LISP. At this time, the operational program is confined to acyclic structures. However, the specifications have been detailed for a complete system, including ring structures of arbitrary com- plexity as well as consideration of optical isomerism (1,2). Table 7.2 lists the computation of all the isomers generated by the topologist for the formula CSH,N02, one of whose isomers Mechanization of Inductive Inference 195 is the common amino acid, alanine. This exercise is already at the very margin of human capability, barring the possible rediscovery of this algorithm. In practice no intelligent human has the patience to attempt to generate such a list by the in- tuitive process. The chemist willoftenthendemand redundant information at this point in order to narrow the range of pos- sibilities he is obliged to consider before he will make the effort to produce an exhaustive list. The topologist knows only the valence rules as quasi- empirical data, that is, that four bonds must issue from each carbon atom, three from any nitrogen, two from any oxygen, and but one from hydrogen. With this very limited quota of chemical insight, the topologist produces many structures that would be regarded as absurdities by the experiencedchemist. for example the radical (.O.NH*OH) in no. 4of Table 7.2. The next stage in the development of DENDRAL is then to impart a certain amount of additional chemical information taken from the real world. Indoingthisadefinite context is implied, even if this is not immediately overt. There are probably many realms of organic chemistry, for example, at ultra low tem- peratures, of which we have only limited experience. The implicit context we have in fact adopted is that of the natural product, that is to say, molecular species that might be reason- ably stable at ambient temperatures, and therefore stand some chance of persisting or being isolated from natural sources. However, this rule has been applied rather cautiously and the lists that will be adduced for further illustration still contain a number of items which would be regarded as quite dubious by this criterion. The program is quite amenable to adjustment to any given set of facts. Indeed, a certain stage in the program can be switched on to interrogate the chemist to help to find the con- text in which various rules will be applied or not. At this stage chemical insight is given most explicitly by providing a list of forbidden substructures. Whenever these substructures are encountered during the building of a potential molecule, the generator is adjusted to ignore that entire branchof synthetic possibilities. In order to effectuate this use of a "BADLIST," a graph matching algorithm has been incorporated into the DENDRAL program. At best, however, graph matching is an expensive proposition and it soon became necessary to seek ways of economizing on redundant computation. The last ((U 1. 2. 3. 4. 5. !: 8. 1:: 2: 13. 14. 15. :;: 18. i:: 21. E: 24. 25. 26. 9;: 29. 30. 31. TABLE 7.2 The Isomers of Ala- nine, C3H7N02, without Chemical Common Sense* >(ISOMERS *AlANINE) >BADLIST NIL >GOODLIST NIL >SPECTRUM NIL > DICTLIST NIL >(ILLEGAL ATTACHMENTS) (NIL NIL NIL) C3H7N02 MOLECULES 1.1 (C . 3.1 (N . 1.1 (0 . 2.)) . C3H7 0.N = 0, . CH..CH3 CH3 0.N = 0, . CH2.CH = CH2 NH .O.OH, . CH2.CH = CH2 O.NH.OH, . CHZ.CH = CH2 O.O.NH2, . CH2.CH = CH2 N..OH OH, . CH = CH.CH3 NH .O .OH. CH = CH.CHi O.NH.OH; CH = CH.CH3 O.O.NH2, CH = CH.CH3 N..OH OH, C.=CH3 CH2 NH.O.OH, C.=CH3 CH2 O.NH.OH, C.=CH3 CH2 O.O.NH2, C.=CH3 CH2 N..OH OH, CH2.CH2.NH2 O.CH = 0, CH2 .CH2.NH2 *COOH, CHZ.NH.CH3 O.CH = 0, CH2.NH.CH3 *COOH a NH .C2H5 O.CH = 0,' NH.C2H5 `COOH, CH..CH3 NH2 O.CH = 0, CH. .CH3 NH2 %OOH, N..CH3 CH3 O.CH = 0, N..CH3 CH3 *COOH, CHZ.CH = NH CHZ.O.OH, CHZ.CH = NH O.CH2.0H, CH2.CH = NH O.O.CH3, . CH2.CH = NH CH..OH OH, . CH = CH.NH2 CH2.0.0H. . CH = CH.NH2 O.CHZ.OH;. . CH = CH.NH2 O.O.CH3, 196 TABLE 7.2 (Continued) 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. CH = CH.NH2 CH2.N = CH2 CH2.N = CH2 CH5.N = CH2 CH2.N = CH2 CH = N.CH3 CH = N.CH3 CH = N.CH3 CH = N.CH3 NH.CH = CH2 NH.CH = CH2 NH.CH = CH2 NH.CH = CH2 N = CH .CH3 N = CH.CHj N = CH .CH3 N = CH.CH3 C.=CH3 NH C.=CH3 NH 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. ::: 83. 84. CH..OH OH, CH2.0.0H. O.CH2.0H; O.O.CH3, CH..OH OH, CHZ.O.OH, O.CH2.0H, O.O.CH3, CH..OH OH, CHZ.O.OH, O.CH2.0H, O.O.CH3, CH..OH OH, CH2.0.0H, O.CH2.0H, O.O.CH3, CH..OH OH, CHZ.O.OH, O.CH2.0H. C.=CH3 NH O.O.CH3, ' C.=CH3 NH CH..OH OH, C = .CH2 NH2 CHZ.O.OH, C = .CH2 NH2 O.CH2.0H; C = .CH2 NH2 O.O.CH3. C = .CH2 NH2 CH..OH 6H, CH2 .CH2 .OH CH2.N = 0, CH2.CH2.OH CH = N.OH, CH2.CH2.OH NH .CH = 0, CH2 .CH2 .OH N = CH.OH, CH2.CH2 .OH O.CH = NH, CH2.CH2.OH 0.N = CH2, CH2.CH2.OH *CONHZ, CH2.CH2 .OH C = .NH OH, CH2 .O .CH3 CH2.N = 0, CH2.D.CH3 CH = N.OH, CH2 .O .CH3 NH.CH = 0, CH2 .O .CH3 N=CH.OH, CH2 .O .CH3 O.CH = NH, CH2 .O .CH3 0.N = CH2, CH2 .O .CH3 KONH3, CH2 .O .CH3 C = .NH OH, 0 .C2H5 CH2.N = 0, D .C2H5 CH = N.OH, O.C2H5 NH .CH = 0, 0 .C2H5 N = CH.OH, 0 .C2H5 O.CH = NH, 0 .C2H5 0.N = CH2, 0 .C2H5 KONH2, 0 .C2H5 C = .NH OH, CH. .CH3 OH CH2.N = 0, CH..CH3 OH CH = N.OH, CH. .CH3 OH NH.CH = 0, CH..CH3 OH N = CH.OH, 197 TABLE 7.2 (Continued) 85. 86. 87. 8 90: 91. 92. 93. 94. 95. 96. 2 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. CH..CH3 OH O.CH = NH, CH. .CH3 OH CH..CH3 OH CH2.CH=O CH..CH3 OH CH2.CH =0 CH2 .CH = 0 CH2.CH = 0 CHZ.CH=O CH2.CH =0 CH2.CH = 0 CH2 .CH = 0 CH = CH.OH CH = CH.OH CH = CH.OH CH = CH.OH CH = CH.OH CH = CH.OH CH = CH.OH CH = CH .OH O.CH=CH2 O.CH=CH2 O.CH=CH2 O.CH = CH2 O.CH=CH2 O.CH=CH2 O.CH=CH2 O.CH=CH2 C.=CH3 0 C. = CH3 0 C. = CH3 0 C. =CH3 0 C.=CH3 0 C. = CH3 0 C.=CH3 0 C.=CH3 0 C = .CH2 OH C = .CH2 OH C = .CH2 OH C = .CH2 OH C = .CH2 OH O.CH2.NH2, . C = .CH2 OH O.NH.CH3, . C= .CH2 OH CH. .NH2 OH, C = .CH2 OH - CH.C2H5 N..CH3 OH, N.O.OH, = C..CH3 CH3 N.O.OH, = CH .CH2 .NH2 CH.O.OH, = CH .CH2 .NH2 C..OH Oi, = CH.NH.CH3 CH.O.OH, = CH.NH.CH3 C..OH OH, = N.C2H5 CH.O.OH, = N.C2H5 C..OH OH, = C..CH3 NH2 CH.O.OH, 198 C= .NH'OH, CH2 .NH .OH, 0.N = CH2; CH2.O.NH2, "CONH2. NH.CH2.0H, NH.O.CH3. O.CHZ.Nii, O.NH.CH3, CH. .NH2 OH, N..CH3 OH. CH2 .NH .0:-i; CH2.O.NH2, NH .CHZ.OH, NH.O.CH3, O.CH2.NH2, 0 .NH.CH3, CH. .NH2 bH, N..CHS OH. CH2.NH.OH; CH2.O.NH2, NH.CH2.0H, NH.O.CH3, 0 .CH2 .NH2, O.NH.CH3, CH. .NH2 OH, N..CH3 OH, CH2.NH.OH, CH2.O.NH2, NH.CHZ.OH, NH.O.CH3, O.CH2.NH2, O.NH.CH3. CH. .NH2~ 6H N..CH3 OH, CH2.NH.OH CH2.O.NH2 NH.CH2.0H NH.O.CH3, TABLE 7.2 (Continued) 138. 139. 140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189. 190. = C..CH3 NH2 C..OH OH, = CH.CH2.0H CH.NH.OH. = CH.CH2.0H CHIO.NH2; = CH .CH2 .OH N.CH2.0H, = CH .CH2 .OH N.0 .CH3, = CH .CH2 .OH C..NHZ OH, = CH.O.CH3 CH.NH.OH, = CH.O.CH3 CH.O.NH2, = CH.O.CH3 N.CH2.0H, = CH.O.CH3 N.O.CH3, = CH.O.CH3 C..NHZ OH, = C..CH3 OH CH .NH.OH, = C. .CH3 OH CH.O.NH2, = C..CH3 OH N.CH2.0H, = C..CH3 OH N.O.CH3, = C..CH3 OH C..NHZ OH, CH... CH3 CH = NH O.OH, c.=. CH3 CH .NH2 O-OH, CH... CH3 N = CH2 O.OH, c.=. CH3 N.CH3 O.OH, CH... CH3 CHZ.OH N=O, C..= CH3 CH2 .OH N.OH, CH... CH3 CH =O NHSiH. CH... CH3 CH = 0 O.NH2, C.= . CH3 CH .OH NH.OH, C.= . CH3 CH .OH O.NH2. c= . . CH2 CH2.NH2 o.oil, c= . . CH2 NH .CH3 O.OH, c= . . CH2 CH2 .OH NH .OH, c= . . CH2 CH2 .OH O.NH2, c= . . CH2 O.CH3 NH.OH, c= . . CH2 O.CH3 O.NH2, CH... NH2 CH = CH2 O.OH, c.=. NH2 CH .CH3 O.OH, CH... NH2 CH2 .OH CH=O, NH2 CH2 .OH CH .OH, NH2 O.CH3 CH=O. c.. = E".= c Y c= :: c= . . c= . . NH2 0 .CH3 CH .OH', NH C2H5 O.OH, NH CH2 .OH CHZ.OH, NH CH2 .OH O.CH3, :I? 0 .CH3 0 .CH3, C2H5 N=O, OH C2H5 N.OH, CH.. . c.. = CH... OH CH = CH2 NH.~H, CH... OH CH = CH2 O.NH2. C.=. OH CH .CH3 NH.OH,' C.=. OH CH.CH3 O.NH2, CH... OH CH2.NH2 CH = 0, C..= OH CH2.NH2 CH .OH, CH... OH NH .CH3 CH=O, 199 O.CH3 N= 0, 0 .CH3 N.OH. 200 Formal Representation of Human Judgment TABLE 7.2 (Continued) 191. 192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209. 210. 211. 212. 213. 214. 215. 216. .= C. CH CH C. C. . . . OH OH OH OH OH . . . = . = . z::: c.=, c.=. c= . . c= . . c= . . NH .CH3 CH.OH, CH = NH CH2.0H, CH = NH O.CH3, CH .NH2 CHZ.OH, CH.NH2 O.CH3, N = CH2 CHZ.OH, N = CH2 O.CH3, N .CH3 CH2.0H, N.CH3 O.CH3, C2H5 NH.OH, C2H5 O.NH2, CH2.NH2 CHZ.OH, CH2.NH2 O.CH3, NH .CH3 CH2.0H, NH .CH3 O.CH3, CH = CH2 O.OH, CH2.0H CH=O, O.CH3 CH = 0, C2H5 CH = 0, CH = CH2 CH2.0H, CH = CH2 O.CH3, 0 0 0 CH3 c- . . c= . . c= . . N . . . N . . . N . . . N . . . N . . . N... OH OH 0 0 0 CH3 CH3 OH OH OH C . . . . CH3 CH3 OH N=O, C . . . . CH3 NH2 OH CH=O. C.... CH3 OH OH CH = Nil; C.... CH3 OH OH N = CH2, C *... NH2 OH OH CH=CH2. *This is a complete list of the topological possi- bilities. The restraints of BADLIST and of a filtered DICTIONARY have been relaxed. Com- pare with Table 7.4: the additional structures here are chemically implausible for the standard context of the intended use of DENDRAL. For example, no structures are empirically known which contain the radical (.O.NH.OH). In these and following tables, the text is all computer output except lines prefixed with >. which are input from the teletype. important feature merely exploits an idiosyncrasy of the DEN- DRAL program that makes it easy to detect linear sequences of nodes that might be on a list of illegal attachments, for example, -N-N-N or -0-O. Of far greater generality is the use of a dictionary of solved subproblems. As soonas the programhas gone a short way towards a solution of any practical problem, DENDRAL would find itself constantly redoing the same subproblems over and over again as it rebuilds radicals on one side of the TABLE 7.3 Calling the Function (PRINDICTER) Results in a Dump of the Dictionary in Its Current State* (PRINDICTER) UOOCOlNOlOOl 1. .CH2.NH.OH 2. .CH2.O.NH2 3. .NH.O.CH3 4. .O.NH .CH3 5. .N..CH3 OH u01c02001 1. .CH2.CH = 0 2. = CH.CH2.0H 3. = CH.O.CH3 4. .O.CH = CH2 5. .C. = CH3 0 6. =C..CH3 OH UOlCOlNOlOOl 1. .CH = N.OH 2. = CH.NH.OH 3. = CH.O.NH2 4. .NH.CH = 0 5. = N.O.CH3 6. .O.CH = NH 7. .O.N = CH2 8. .*CONH2 uooco2oo1 CH2.CH2.OH 2. .CH2.;:CH3- 3. .O.C2H5 4. .CH..CH3 OH uooco1oo2 (NO STRUCTURES) u00c01001 1. .CH2 .OH 2.. .O.CH3 UOlC02NOl 1. .CH2.CH = NH 2. .CHZ.N = CH2 3. .CH = N.CH3 4. = CH.CH2.NH2 5. = CH.NH.CH3 6. .N = CH.CH3 7. = N.C2H5 8. .C. = CH3 NH 9. = C..CH3 NH2 UOlCOlNOl 1. .CH = NH 2 = CH.NH2 .N = CH2 4. = N.CH3 u01c01002 .O .CH = 0 2. .*COOH1 ' 201 TABLE 7.3 (Continued) u01c01001 .CH = 0 2. = CH.ki UOOC02NOl CH2.CH2.NH2 2. .CH2.1Nii CHj 3. .NH .C2tt; 4. .CH. .CH3 NH2 5. .N. .CH3 CH3 UOOCOlNOl .CH2 .NH2 2. .NH.CH;- UOON01002 (NO STRUCTURES) UOONOlOOl .NH.OH 2. .O.NH;' UOONOl 1. .NH2 UOlCO3 1. .CH2.CH = CH2 5: .CH = CH.CH3 =CH.C2H5 4. .c. = CH3 CH 5. = C..CH3 CH3 UOlCO2 1. .CH = CH2 2. = CH.CH3 u01c01 1. =CH2 UOlN01002 1. .O.N = 0 UOOOD2 (NO STRUCTURES) UOlNOlOOl 1. .N = 0 2. = N.OH UOlNOl 1. = NH u00001 1. .OH uo1002 (NO STRUCTURES) u01001 1. =o iJooco3 2. .CH?.CH3%~; uooco2 1. .C2H5 202 Mechanization of Inductive Inference 203 TABLE 7.3 (Continued) u00c01 1. .CH3 DONE o This example shows the dictionary that was built ??? Table 7.4, and contains the radicals needed to gen- erate the molecules isomeric to C3H7NO2. The headings encode the compositions in the form UaaCbb NccOdd where C, N, 0 have their usual connotation of atoms, and U stands for "unsaturations." This is calculated as double-bond-equiva- lents, or the number of pairs of H by which the composition falls short of a saturated, that is, double-bond- free molecule. molecules after reconstructing the other side, In order to avoid the waste involved in this redundancy, the program automatically generates a list of compositions which is con- sulted whenever a new radical is to be generated. If the com- position of the new radical appears in the dictionary, the dictionary contents are simply copied out. If not, the problem is solved and a new dictionary item is entered for further use later, Insofar as the dictionary has already been filtered with respect to BADLIST, a great deal of effort can be saved, and in fact the program would not be practical for molecules of even moderate complexity were it not for this feature. As an example, the dictionary that has been generated in the solution of the alanine problem is given in Table 7.3, and the filtered list of isomers is Table 7.4. It is also feasible and desirable to give chemical insight into the program by overt manipulation of the dictionary. That is to say, when a given context calls for it, the radicals corresponding to a given composition can be entered directly, usually with the aim of excluding certain idiosyncratic items. This must be done with great care, since the list of larger radicals that may be generated later relies upon the dictionary already established for smaller radicals. A serious problem encountered in practice is managing the trade-off between the growth of the dictionary and the cor- responding loss of scratch space for the LISP program to TABLE 7.4 The Isomers of Alanine, CsH?NOt, Restrrlned by Common Sense* >(ISOMERS MIANINE) > BADLIST NC (1. (N 0)) (1. (N ON) UN 0) (1. C (1. (N 0)))) (C (3. C (1. (N 0) (1. ii)))) (C (3, C) (1. tN 0) (1. MM ON 0) (1. HI (1. C (3. 0)) (C (2. C (1. (N 0) (1. H)))) (C (2. C) (1. (N 0) (1. HI)) UN 0) (1. tl) (1. C (2. C))) (N (2. C (1. 0 (1. Ii)))) (C (2. N) (1. 0 (1. H)I) (0 (1. (1. C (2. NM (0 (1. 0)) (0 (1. N (1. ON (N 11. 0) (1. 011 K (1. HI (1. N (2. ON (N (1.' C (1. HI) (2. 0)) (*o* (1. 0 (1. HII) UN 0) (1. C (1. 0 (1. HII (2. 0)))) > GOODLIST NIL > SPECTRUM NIL > DICT LIST NIL >(ILLEGAL ATTACHMENTS) (NIL ((N N N)) ((0) WHZOHW CW7NO2 MOLECULES NJ . 1.1 (C . 3.1 (N . 1.) (0 . 2.)) 5: 3. 4. ii: 2 9. ii: K 14. ii: 17. ii: 20. Ii: Z: 25. . . C3H7 0.N = 0, CH..CH3 CH3 0.N = 0, CH2.CH2 .NH2 0 .CH =0, CH2.CH2 .NH2 *COOH, CHZ.NH.CH3 *COOH, NH .C2H5 O.CH = 0, CH. .CH3 NH2 SOOH, N..CH3 CH3 O.CH = 0, CHZ.CH2.OH CH = N.OH, CH2 .CH2 .OH NH .CH = 0, CHZ.CH2 .OH O.CH = NH, CH2 .CH2 .OH 0-N = CH2, CH2 .CH2 .OH `CONHZ, CH2 .O.CH3 CH = N.OH, CHZ.O.Ct-43 *CONHZ, O.C2H5 CH = N.OH, O.CZH5 NH.CH = 0, CH..CH3 OH CH = N.OH, ;t+;.g3 00" `CONH2, CH2:CH = 0 CHZ.NH.OH, CH2.O.NH2, CHZ.CH = 0 NH.O.CH3, CH2 .CH = 0 O.NH.CH3, cyt& i N..CH3 OH, C CH2 .NH.OH, 204 tl) Mechanization of Inductive Inference TABLE 7.4 (Continued) 26. 27. 28. 29. 30. 2 33. 34. E: 37. 2 40. 41. fS: 44. 45. 46. 47. 48. 49. 50. . C.=CH3 0 CH2.O.NH2, . C. =CH3 0 NH.O.CH3, . C.=CH3 0 O.NH.CH3, . C.=CH3 0 N..CH3 OH, CH.CH2.0H l CH.CH2.0H CH.O.NH2, N.O.CH3, = CH.O.CH3 CH.O.NH2, = CH.O.CH3 N.O.CH3, C..= CH3 CH2 .OH N.OH, C = CH3 Cl?... CH3 0 .CH3 N.OH, CH = 0 NH (SETQ GOODLIST SAVEGOODLIST) WCOOH* (1. C (1. 0) (2. 0)) 100 . 01 (*CO* (2. C (2. 0)) iO0. 0.) -(*CHNH2* (2. C (1. N)) 100. 0.) FCH20H" (1. C (1. 0)) 100. 0.1 (WOH* (2. N (1. 0)) 100. 0.) ;;;N"o*))(l. C (2. N)) 100. 0.1 FNCH2* (1. N (2 C)) >clSOn;rEdS *ALANINE) MOLECULES ((U . 0.) (C . 1.1 (*COOH* . 1.) (%HNH2* . 1 .)I 1. . CH2.CH2.NH2 %OOH, 2. . CH. .CH3 NH2 xCOOH, MOLECULES KU . 0.) (C . 2.) (N . 1.) (XCOOH" . 1.)) 1. . CH2.NH.CHS *COOH, MOLECULES uu . 0.) wo* . 1.) (*CHNH2* . 1.) (*CH20H* . 1.)) 1. c=.. 0 CH2.NH2 CH2.0H, 2. CH... NH2 CH2 .OH CH = 0, MOLECULES KU . 0.1 (C . 1.) (0 . 1.) (*co* . 1.1 1. c=.. 0 CH2.NH2 0 .CH3, CH2.CH2.NH2 3: CH... OH O.CH=O, CH2.NH2 CH = 0, MOLECULES ((IJ . 0.) CC . 1.) (N . 1.) (*CO* . 1.) 1. c=.. 0 NH.CH3 CH2.0H, 2. . CH2 .CH2 .OH NH .CH = 0, 3. . CH2.CH2.OH *CONH2, MOLECULES ((U I 0.) (C . 2.) (*CO* . 1.) (*NOH* . c= 0 :: N..:' OH C2H5 NH .OH, C2H CH = 0, 3. . CH2.CH = 0 CH2.NH.OH, 4. . CH2 .CH = 0 N..CH3 OH, 5. . C. = CH3 0 CH2.NH.OH, 6. C.=CH3 0 7. CH... N..CH3 OH, CH3 CH = 0 NH.OH, MOLECULES (KHNH2" ("CH20H* 1 .)) ((IJ . 0.) (`2 . 2.1 (N . 1.) (0 . 1.) ('CO* , 1.1) :: : CH2.O.CH3 "CONH2, CH2.CH = 0 CH2.O.NH2, 3. . C. = CH3 0 CH2 .O .NH2, 4. . NH .C2H5 O.CH = 0, 5. . CH2 .CH = 0 NH.O.CH3, 6. . C. = CH3 0 NH.O.CH3, 7. . O.C2H5 NH.CH = 0, 207 1.)) 1 .)I TABLE 7.5 (Continued) 8. . CHZ.CH = 0 O.NH.CH3, 9. C.=CH3 0 c= 0 0 .NH.CH3, 10. C2H5 iii. .CH3 OH O.NH2, 11. 12. CH... CH3 *CONH2, CH = 0 O.NH2, 13. N..CH3 CH3 O.CH = 0, 14. Ii... CH3 O.CH3 CH =0, MOLECULES uu . 1.1 (%HNH2* . 1.) KH20H* . 2.1) XNO ALLOWABLE STRUCTURES MOLECULES ((U . 1.) cc . 1.1 (0 . 1 .I (eHNH2* . 1 .I VCH20H* . 1.)) MOLECULES ((U . 1.1 (C . 2.) (0 . 2.) (*CHNH2* . 1.1) *NO ALLOWABLE STRUCTURES MOLECULES (NJ . 0.) FCH20H* . 2.1 FCHNH* . 1.1) *NO ALLOWABLE STRUCTURES MOLECULES ((U . 0.) PCH20H* . 2 .I PNCH2*. 1.)) `?dO ALLOWABLE STRUCTURES MOLECULES KU . 1.1 (C . 1.1 (N . 1.1 (KH20H* . 2.)) 1. C=.. NH CH2.0H CH2 .OH, MOLECULES ((U . 1.1 (C . 2.1 (XCH20H" . 1.1 (*NOH* . 1.1) 1. CH2.CH2.OH CH = N.OH, 2. i..= CH3 CH2 .OH N.OH, MOLECULES au . 0.) (C . 1.1 (0 . 1.1 (`CH20H* . 1.1 (`*CHNH* . 1.)) 1. CH2.CH2.OH 2. CH... OH O.CH = NH, CH = NH CHZ.OH, MOLECULES ((U . 0.) (c . 1.) (0 . 1.1 (*CH20H* . 1.) FNCH2* . 1.1) 1. . CH2.CH2.OH 0.N = CH2, MOLECULES ((U , 1.) (C . 2.) (N . 1.) (0 . 1.1 (`CH20H* . 1.)) :: = CH .CH2 .OH CH .O.NH2, = CH.CHZ.OH N.0 .CH3, 3. C= . . CH2 CH2.0H 0 .NH2, 4. C = .`. NH CH2 .OH O.CH3, MOLECULES ((U . 1.) (`i-l . 3.) (0 . 1.) (*NOH* . 1.)) 208 Mechanization of Inductive Inference 209 TABLE 7.5 (Continued) 1. . CH2.O.CH3 CH = N.OH, 2. , O.C2H5 CH = N.OH, 3. CH..CH3 OH CH = N.OH, 4. i..= CH3 O.CH3 N.OH, MOLECULES ((U . 0.) (C . 2.1 (%HNH* . 1.)) kNO ALLOWABLE STRUCTURES MOLECULES ((U . 0.) cc . 2.1 (0 2.1 (*NCHZ* . 1.)) j;NO ALLOWABLE STRUCTURES MOLECULES KU . 1.1 (C 3.1 (N . 1.1 (0 . 2.1) 1. . CjH7 0.N = 0, 2. CH. .CH3 CH3 0.N = 0, 3. i CH.O.CH3 CH.O.NHZ, 4. = CH.O.CH3 N.O.CH3, *Substructures defined in GOODLIST are pre- vented from reappearing except under the cor- responding superatom. Thus the final block of four molecules is the group containing none of the defined superatoms: *COOH*, *CO*, *CHNH2*, *CH20H*, *NOH*, *CHNH*, *NCHP*. In many applications, the count of a given super- atom will be set to zero for a particular con- text, or conversely, to non-zero. For example, the superatom o NCH2* is quite likely to be suppressed if the chemist knows that form- aldehyde was not used in the synthesis of the molecule being analysed. These computations are brought to the surface here only in order to reveal the heuristic revision of priorities that is available to DENDRAL. In actual problem solving, many of these hypotheses would be rejected long before a trial molecule was completed. REFERENCE TO DATA With these facilities we are now readytoattempt to apply DENDRAL to explicit data. The actual processes in the mass spectrometer are too complicated to be deaIt with head-on in the first instance. We therefore deal with various models of 210 Formal Representation of Human Judgment the behavior of the mass spectrometer, the theories of mass spectrometry. To exercise the simpler logical elements of heuristic DENDRAL, we begin with a zero order theory, one which postulates that the mass spectrum is obtained by assign- ing a uniform intensity to each fragment that can be secured by breaking just one bond in the molecule. We neglect the splitting of bonds affecting only a hydrogen atom. To test the program we do not at first use a real spectrum, but rather the spectrum predicted by this idealized theory for some given isomer. As before, the predicter is deeply embedded within the DENDRAL generator, so that the structure building tree is truncated at the earliest point that a violationof the theory by the data set is encountered. This leads to a very efficient set of trials, not of completed, but of tentative and partial structures when the program is given a molecular composition and a hypothetical zero-order spectrum. The essence of the program is to generate all of the partitions at a given level. and then to scan these for compatibility with the mass list of the fragments. There are also some pertinent apriori con- siderations about the partitioning of molecular compositions, and this has been used to reorder the primary partitions in the most plausible sequence. We manage the sequence with which hypotheses are tested but still retainthe exhaustive and irredundant character of the generator. Owing to imperfect memory and nonstandard formats, human judgment rarely succeeds so well at this. Each of the plausibility operations plainly should and can be related to a statement of context. For example, in setting up the GOODLIST, the chemist will be interrogated about the likelihood of certain radicals, and cues for this can also be obtained directly from the mass spectrum, For example, the program is aware that mass number 45 is almost pathognomic for the radical -COOH. Hence, this superatom will be set to zero in the absence of a signal at that mass. Conversely, in a high-resolution analysis, the occurrence of mass number 44.998 would justify fixing -COOH as nonzero. PERFORMANCE The description, so far, characterizes an operational Mechanization of Inductive Inference 211 program. Its main features can be routinely demonstrated without special preparation by remote teletypewriter inter- actions with the PDP-6 computer at Stanford University. DENDRAL has been tested in a number of ways in an attempt to evaluate its performance as a working tool. It will, of course, vastly outdo the human chemist in such contrived but potentially useful exercises as making an exhaustive and irredundant list of isomers of a given formula (Table 7.4 shows this for C,H,NOz). In many cases, particularly when an adequate dictionary has been previously built and no further entries are being made? the computer will output its solutions at teletype speed. The program is also slightly faster than the human operator at subgraph-matching, that is, searching a series of molecular structures for the presence of any member of a given list of forbidden embedded subgraphs. It will outdo the human by approximately 100 : 1, or perhaps better if ac- curacy is given due weight in converting structural representa- tions into canonical form and testing for isomorphism. A few real spectra have been input, with surprisingly crisp results in view of the known imperfections of the zero order theory of mass spectrometry. Thus heuristic DENDRAL was run with data on threonine obtained with a Bendix time-of-flight instrument (Fig. 7.2). The program returned two solutions, threonine, the correct structure, and one other (Fig. 7.3). The second isomer has not, to our knowledge, been analyzed by mass spectrometry. However, its spectrum can be predicted to resemble that of threonine very closely in its qualitative features. When Dendral was challenged with C,H,NO, under the conditions of Table 7.4 it returned 238 "plausible isomers," of which only these two satify the data according to the pro- gram's model of the theory of mass spectrometry. The in- clusion of the data shortens the computationtime from about 30 minutes to about 3 minutes. It is not easy to test the exhaustiveness of the DENDRAL generator without extensive files of known structures. How- ever, it is possible to write recursive combinatorial expres- sions to count the expected numbers of isomeric alkane mole- cules (C.H,,+.J and alkyl radicals (-CnHan+J as shown in Table 7.6. These numbers have been verified by DENDRAL for radicals through C9H,, and for molecules through &HZ8, after which the LISP program structure becomes too unwieldy to 90 80 70 1 60- 8,- 5 _ c $40 - .30 - 20 - ----_. 10 - 0 IL -- _1 0 20 -- ?"1 ---- d .---L--B---- Fig. 7.2 The mass spectrum of threonine (CH.. .NH2 CH. .CH3OH C. -OH 0) presented as a bar diagramfromMartin(1965). This yielded a list of mass numbers: (15 16 17 18 29 30 43 45 56 57 74 75 119). This list was input to DENDRAL. which responded to (ISOMERS C4H9N03) with two solutions: 1. CH. .NH2 CH. .CH3 OH C. = OH 0 threonine 2. c.... CH3 OH CH2.NH2 C. =OH 0 See Fig. 5.3. (4 (b) Fig. 7.3 (a) Threonine, (CH...NH2 CH..CHJ OH C. =OH 0). and (b) 2- Methyl, P-hydroxy, 3-aminopropionic acid. (C....CH3 OH CH2.NH2 C. =OH 0). 212 Mechanization of Inductive Inference 213 TABLE 7.6 Counting the Isomeric Alkanes and Alkyl Radicals: C.H2,+2and GHz~+~* c Alkane Alkyl C Alkane Alkyl 1 1 1 11 159 1238 2 1 1 12 355 3057 3 1 2 13 802 7639 4 2 4 14 1858 19241 5 3 8 15 4347 48865 6 5 17 16 10359 124906 7 9 39 17 24894 321198 8 18 89 18 605"3 I< 832019 9 35 211 19 148284 2156010 10 75 507 20 366319 5622109 *The figures were generated by a computer program folIowing the algorithm of Henze and Blair (1931). Cayley's historic algorithm is incorrect, buut is still quoted by a recent monograph on applications of graph theory (1965). continue in core memory. Since there are no chemical pro- hibitions, the list (Table 7.7) of 75 isomeric decanes may illus- trate the systematic combinatorial aspects of DENDRAL more vividly to a human reader than the preceding outputs do. Iso- mers are, of course, vastly more numerous for compositions containing some N and 0 atoms. Facilities have been provided in the past, but are not available on our present computer system owing to hardware limitations, for providing two-dimensional graphic displays of structural maps as translations of DENDRAL notations. These programs also enabled man-computer interactions where the chemist could manipulate chemical structures to a substantial degree. Where DENDRAL begins to be shaky is, as usual, when confronted with subtle changes of context which the user may often find difficult to communicate precisely to the program, even when he can do this readily to his fellow scientists. As far as possible we seek to get out of this difficulty by building interrogation subroutines into the program so that the chemis! can provide data rather than obliging him to write new program text in the LISP language. Present efforts are concentrated on elaborating the theory of mass spectrometry as represented in TABLE 7.7 The 75Isomers of Decane, CJL (ISOMERS ClOH22) MOLECULES ((U 1. C: 4. 5. 6. 2 9. 10. 11. :;: 14. ::: 17. 18. :`o: 21. 22. 23. 24. 25. 26. 27. 28. 29 . 30. 31. E: 34. 35. 36. 37. 2 40. 41. 42. 43. 44. 4465. 47: 4%. 49. 0.) (C . 10.)) . CH2 .CH2 .C3H7 CH2.CH2.C3H7, . CH2 .CH2 .C3H7 CH2 .CH2 .CH. .CH3 CH3, . CH2 .CH2 .C3Hf CHZ.CH. .CH3 C2H5, . CH2 .CH2 .C3H7 CH2.C...CH3 CH3 CH3, . CH2 .CH2 .C3H7 CH. .CH3 C3H7, . CH2.CH2.C3H7 CH. .CH3 CH, .CH3 CH3, . CH2 .CH2 .C3H7 CH. .C2H5 C2H5, . CH2 .CH2 .C3H7 C...CH3 CH3 C2H5, . CH2.CH2.CH. .CH3 CH3 CH2.CH2.CH. .CH3 CH3, . CH2 .CH2 .CH. .CH3 CH3 CHZ.CH ..CH3 C2H5, . CH2.CH2.CH..CH3 CH3 CH2.C.. .CH3 CH3 CH3, CH2.CH2.CH..CH3 CH3 CH. .CH3 C3H7, CH2.CH2.CH..CH3 CH3 CH. .CH3 CH. .CH3 CH3. CH2 .CH2 .CH 1 .CH3 CHi CH ..C2ii5 C2i5, ' CH2.CH2.CH..CH3 CH3 C. ..CH3 CH3 C2H5, CH2.CH. .CH3 C2H5 CH2 .CH. .CH3 C2H5, CH2 .CH. .CH3 C2H5 CH2.C.. .CH3 CH3 CH3, CH2.CH..CH3 C2H5 CH. .CH3 C3H7, CH2.CH..CH3 C2H5 CH. .CH3 CH. .CH3 CH3, CH2.CH. .CH3 C2H5 CH. .C2H5 C2H5, CH2.CH. .CH3 C2H5 C...CH3 CH3 C2H5, CH2.C...CH3 CH3 CH3 CH2.C. ..CH3 CH3 CH3, . CH2.C...CH3 CH3 CH3 CH. .CH3 C3H7, . CH2.C...CH3 CH3 CH3 CH..CH3 CH..CH3 CH3. . CH2.C...CH3 CH3 CHi ' CH. .C2H5 C2H5, . CH2.C...CH3 CH3 CH3 C. ..CH3 CH3 C2H5, . CH. .CH3 C3H7 CH. .CH3 C3H7, . CH. .CH3 C3H7 CH. .CH3 CH. .CH3 CH3, . CH..CH3 C3H7 CH. .C2H5 C2H5, . CH. .CH3 C3H7 C. ..CH3 CH3 C2H5. . CH..CH3 CH..CH3 CH3 CH. .CH3 Cti. .CH3 CH3, . CH..CH3 CH..CH3 CH3 CH, .C2H5 C2H5, . CH..CH3 CH..CH3 CH3 C. ..CH3 CH3 C2H5, . CH..C2H5 C2H5 CH. .C2H5 C2H5, . CH. .C2H5 C2H5 C. ..CH3 CH3 C2H5, . C...CH3 CH3 C2H5 C...CH3 CH3 C2H5. CH... E::: CH... CH... CH... CH... 2::: CH... :i::: CH... CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 C2H5 C2H5 C2H5 CH2 .C3H7 CH2.C3H7, ' CH2.C3H7 CH2.CH. .CH3 CH3, CH2 .C3H7 CH. .CH3 C2H5, CH2 .C3H7 C.. .CH3 CH3 CH3, CH2.W. .CH3 CH3 CHZ.CH. .CH3 CH3, CHZ.CH. .CH3 CH3 CH. .CH3 C2H5, CH2.CH..CH3 CH3 C.. .CH3 CH3 CH3, CH. .CH3 C2H5 CH. .CH3 C2H5, CH. .CH3 C2H5 C...CH3 CH3 CH3, C...CH3 CH3 CH3 C.. .CH3 CH3 CH3, C3H7 CH2 .C3H7, C3H7 CH2 .CH. .CH3 CH3, C3H7 CH. .CH3 C2H5, 214 Mechanization of Inductive Inference 215 TABLE 7.7 (Continued) 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 2: 75. CH... CH... CH... CH... CH... CH... CH . _ . CH... CH... C.... C2H5 C3H7 C. ..CH3 CH3 CH3, C2H5 CH. .CH3 CH3 CH2 .C3H7, C2H5 CH. .CH3 CH3 CHZ.CH. .CH3 CH3, C2H5 CH. .CH3 CH3 CH. .CH3 C2H5, C2H5 CH. .CH3 CH3 C. ..CH3 CH3 CH3, C3H7 C3H7 C3H7, C3H7 C3H7 CH. .CH3 CH3, C3H7 CH. .CH3 CH3 CH. .CH3 CH3, CH. .CH3 CH3 CH. .CH3 CH3 CH. .CH3 CH3, CH3 C.... CH3 CH3 ::::' CH3 C...: CH3 C.... CH3 C . . . . CH3 C . . . . CH3 C.... CH3 C . . . . CH3 CH3 k::: CH3 C . . . . CH3 C . . . . CH3 k::: CH3 C2H5 C . . . . C2H5 CH3 C2H5 CH3 CH3 CH3 C2H5 CH3 CH3 CH3 CH3 C2H5 C3H7 C2H5 CH2.C3H7, C3H7 CHZ.CH. .CH3 CH3, CH2.CH. .CH3 CH3, C3H7 C2H5 CH. .CH3 C2H5, CH..CH3 C2H5. C3H7 C.. .CH3 CH3 CH3, CH..CH3 CH3 CH2.C3H7, CH..CH3 CH3 CH2.CH. .CH3 CH3, CH..CH3 CH3 CH _ .CH3 C2H5, CH..CHS CH3 C...CH3 CH3 CH3, C2H5 CH2.C3H7, C2H5 C2H5 C...CHi CH3 dH3, C2H5 C3H7 C3H7, C2H5 C3H7 CH _ .CH3 CH3, C2H5 CH. .CH3 CH3 CH. .CH3 CH3, C2H5 C2H5 C3H7, C2H5 C2H5 CH. .CH3 CH3, the predictor subprogram. This is giving verypromising re- sults, the chief limitations being (1) the precise definition of the rules actually used by the chemist and operant in nature, and (2) the translation of these conceptual algorithms into viable program. These two issues are. however. not as in- dependent as might be imagined. It is the clumsiness of the program writing and debugging that impedes rapid testing of the correctness with which a rule has been formulated. In our experience each half hour of conference has generated approxi- mately a man-month of programming effort. It is obvious that despite the simplicity of the DENDRAL notation for chemical structures, we still have a long way to go in the development of a language for the simple expression of other conceptual constructs of organic chemistry, particularly context defini- tions and reaction mechanisms. Insofar as programs are also graphs and an effective sub- routine may be regarded as a hypothesis that matches its in- tended functions, the latter being both logically deducible and 216 Formal Representation of Human Judgment operationally testable by running the subroutine, program writing may be regarded as an inductive process roughly analogous to the induction of structural formulas as solutions to sets of chemical data. We believe it may be necessary to produce a solution to this meta language puzzle before the implementation of human ideas in computer subroutines can proceed efficiently enough for the rapid and effective transfer of human insights into machine judgment. Nevertheless. by the rather laborious process that we have outlined, DENDRAL has proceeded to that stage of sophistication where it is at least no longer an occasion of embarrassment to demonstrate it to our scientific colleagues and friends who have no interest whatsoever in computers per se. The deferral of cyclic structures will weaken the casuistic impact of the program upon chemists. However, the acyclic molecules give sufficient play for analyzing the inductive pro- cess. Furthermore, it may be advantageous to leave a blemish that diminishes the latent threat of artificial intelligence to human aspirations. However, a complete notation and specifi- cations for cyclic DENDRAL have been documented (Leder- berg, 1965) and this is being programmed now in response to the utilitarian demands of chemist friends. BUILDING DE NDRAL DENDRAL was developed in the LISP 1.5 and 1.6 dialects. The original package was composed by Mr. William White working from the' specifications summarized in Table 7,1, and a version of DENDRAL which almost worked was generated on the IBM 7090 with the helpof a time-shared editing system run on the PDP- 1. When the LISP system on System Develop- ment Corporation's Q-32 became available to us, we pursued a vigorous programming effort by remote teletype communica- tion from Stanford to Santa Monica. This proved to be a very powerful and remarkably reliable system, and Mr. White and Mrs. Georgia Sutherland perfected the program (Sutherland, 1967) on that computer with a total effort of about one year. In retrospect it is quite obvious that the program simply could never have been written and debugged without the help of the rapid interaction provided by the time-sharing system. We stress ylezx?l' advisedly, in the light of our own experience Mechanization of Inductive Inference 217 with the human frustrations involved in the typicalturnaround times for error detection and error correction under the operating system for the IBM 7090. In November 1966 we moved our operations to LISP 1.5 on the PDP-6 computer in- stalled for the Artificial Intelligence Project at Stanford. Despite the avowed close compatibility of the LISP systems, approximately 3 man-months of effort were required to trans- fer the program from one dialect to the other. PATTERN RECOGNITION As the candidate structures become more and more com- plex we have to abandon the idea of exhaustive enumeration of possible structures. Instead, the data are scrutinized for cues that offer any preference for certain kinds of structures as starting points. As we keep examining the problem we do find more and more ways in which such cues can be exploited. For example, an elementary pattern analysis of the period with which mass numbers are represented, for example, for gaps in the sequence of mass numbers with significant intensity around a period of about 14 mass units (CH,), can give signifi- cant hints about the existence of a number of branch points within the molecule. If these can be limited, the extent of the necessary tree building can be drastically curtailed from first principles. Likewise, an examination of mass numbers ap- proximating half the total molecular weight can lead to some trial hypotheses about the major partition of the molecule, which again can truncate the development. We do not, how- ever, yet have a program sophisticated enough to make a pro- found reexamination of its own strategy at any level more complicated than the resetting of numerical parameters, a limitation closely related to the meta language challenge mentioned above. In sum, we find that the development of this program has not encountered very much that is fundamentally new in principle: problem solving in this field has much the same flavor as the solutions already adduced for chess, checkers, theorem proving, etc. One possible advantage of pursuing investigations in artificial intelligence and heuristic programming within this framework is that the practical utility of what has already been produced should suffice to en- gage the attention of a considerable number of human chemists 218 Formal Representation of Human Judgment working on practical problems in a fashion that lends itself to machine observation and emulation of their techniques. PROGRAMMING AS INDUCTION The game of writing programs becomes more and more an experimental science as the complexity of the programs in- creases. At the limit, the programmer has the insecure hope that his text will (1) run and (2) accomplish the intended goals, that is, his program is a hypothesis that needs deductive elaboration to verify it. This suggests that program writing ought to be mechanized by a process analogous to the induction of chemical hypotheses by DENDRAL and starting with mechanized observations of human techniques of problem- solving, The pervasive role of analogy in human judgment suggests that much would be gained in artificial intelligence if a large compatible tool kit of successful programs were available both to the human and the mechanized programmer. Unfortunately, artificial intelligence researchers go to such excesses in their originality and improvisation of idiosyncratic dialects, that there is no easy way in which past successes of unpredictable relevance can be immediately tried out for a new problem. Experimental science, on the other hand, is replete with impor- tant advances that resulted from the provocative availability of a new technique waiting on the shelf to find a use. Indeed, mass spectrometry itself has exactly that history.