1977-78 Annual E&port Rk-00612 Section 3.1 Notes to Figure 9: MST1 is the maximally specific rule version. MGVl and MGV2 are maximally general rule versions. Only the rule patterns (left hand sides) are shown above. Al.1 rules shown predict the same action: the appearance of a peak associated with atom 9" in the range 14.0 to 14.7 ppn. downfield from 'IBIS. The version space represented in Figure 9 above contains several hundred rule versions: the three versions shown plus all versions between these in the general-to-specific ordering. However, it can be represented simply by the two maximally general versions, MGVl and MGV2, and the single maximally specific version, MSVl. The single most specific version contains every node and node attribute constraint consistent with all positive training instances. In this program the classes of positive and -negative training instances are sets of molecules for which the indicated spectral peak does and does not appear. Thus, any rule version more specific specific than MSVl cannot match every positive instance. Two general versions are required in this case since neither is "above" the other in the general- to-specific partial ordering. Any rule more general than either MGVY. or MGV2 will match some negative instance. Furthermore, any rule which is between these general and specific boundaries of the version space will match all current positive instances (by virtue of being more general than MSVl), and will match no current negative instances (by virtue of being more specific than MGVl or MGV2). 3.1.3.3 Version Spaces and Hule Learning Rather than select a single best rule version, the candidate elimination algorithm represents the space of all plausible rule versions, eliminating from consideration only those versions found to conflict with observed training instances. Thus, the candidate elimination approach separates the deductive step of determining which rule versions are plausible, from the inductive step of selecting a current-best- hypothesis. The algorithm is assured of finding all correct versions of the rule after all training data has been presented without the need to backtrack to reconsider previous training data or decisions. In this example, RULEGEN was used to generate a set of plausible rules characterizing the CM.H spectra of a set of 76 1977-78 Annual &port RR-00612 Section 3.1 training molecules. For each rule, the associated evidence was given to a the candidate elimination routine which formed the version space for this evidence set. Subsequent data may be analyzed to modify the version space in a manner guaranteed to be consistent with the original data. The candidate elimination algorithm operates on the maximally general and maximally swific sets representing the version space. The set of maximally general rule versions (MGV) is initialized to a single rule consisting of the most general possible rule subgraph (a single atom graph with no constrained node attributes) , and the predicted shift range determined by RIJLEGEN. Ihe set of maximally specific versions (MSV) is initialized to a rule which contains as its subgraph the entire molecule associated with the first observed positive instance. The initial version space represented by these extremal sets therefore contains all rules which match the first positive training instance (the most general possible rule, the very specific rule, and all intermediate rules). The training instances are then considered one at a time. Each training instance is used to eliminate from the version space those rule versions which conflict with that instance. This is always accomplished by shifting the maximally specific and maximally general boundaries of the version space toward each other as shown in Figure 10. -e--w l I m0re t specific I f I I I / I I I more I generdl * - -- I-- Most Specific Versions I positive / instances I i * I T negative I f instances 1 I I Most General Versions I I ---- Figure 10. Effect of Positive and Negative Training Instances on Version Space Boundaries Positive training instances force elements of MSV to become more general, whereas negative training instances force elements of MGV to become more specific. The maximally specific set can, 77 1977-78 Annual Report RR-00612 Section 3.1 of course, never be replaced by a more specific set (nor the maximally general set by a more general one) since by definition, any version outside the current version space boundaries is inconsistent with previous training data. 'The action taken by the candidate elimination algorithm in updating the extremal sets is given below. For negative training instances , each element of NGV which matches the instance must be replaced by a set of minimally more specific versions which do not match the instance. These new versions are obtained by adding constraints taken from elements in MSV in order to ensure that they remain more general than some MSV, and thus remain consistent with previous positive instances. Furthermore, each element of MSV which matches the negative training instance must be eliminated from the set (since it is already maximally~ specific, it cannot be replaced by a more specific version). For positive training instances, any elements from NSV which do not match the new instance are replaced by a set of minimally more general elements which do match the instance. In order to ensure that these more general versions do not match past negative training instances , any which are not more specific than at least one element of MGV are eliminated. Elements from MGV which do not match the positive instance are eliminated. After processing each training instance, the new maximally general and maximally specific sets will bound the space of all rules consistent with the observed data. 3.1.4 Current Status and Future Work The incremental learning ability for Meta-DENDRAL depicted above in Figure 8 is almost fully implemented, but as yet remains untested. Wutines for defining and modifying rule version spaces are implementd, as well as the ability to filter out training data explained by a rule set. The major unimplemented portion of the incremental learning scheme is the process for merging new rules into the evolving rule set. The chief issue here is deciding when and how to chose among or merge new rules which are similar to existing rules. We expect to complete implementation and initial testing of the incremental learning ability during 1978. Among issues associated with the version space approach which we expect to explore during the current grant period are the following: 1) Intelligent selection of new training data from exznination of partial results. 2) Applying chemical plausibility 78 1977-78 Annual Report RR-00612 Section 3.1 information to select a 'Ibest" rule version from among those contained in the version space. 3) The extension of current methods for dealing more completely with noisy and ambiguous training data. 4) The use of version spaces for merging similar rules. 3.2 New Capability lb I3nphasize Discrimjnatory Fewer One important intended use of rules formed by Meta-DENDRrlL is the prediction of mass spectra for use in structure elucidation: Predicted spectra for a set of candidate structures are compared by computer with themass spectrumobserved for an unknown compound, and on this basis the candidates are ranked according the likelihood of their identity with the unknown. The ability of rules, in this context, to differentiate correctly among candidate hypotheses is called their "discriminatory power." Since the selection criteria previously used by Meta- DENDRAL during the various stages of rule formation did not necessarily correlate with high discriminatory power, it was decided to provide the program with the option of directly emphasizing discriminatory power during rule formation, in order to maximize the usefulness of the resulting rules for purposes of structure elucidation. This addition to Meta-DENDRAL has now been designed and implemented. The general method employed by the the new option is as follows. Observed mass spectra of the training molecules are analyzed prior to rule generation to determine how diagnostic the various observed peaks are, within the training set, of the molecules that show them. This information is then used during rule formation to compute a measure of discriminatory- power for emerging rules. This measure is used, in combination with other criteria, to guide the search during rule generation, and to control the modification and selection of rules during the later phases of processing. Preliminary testing of this new rule formation scheme on the monoketoandrostanes produced rules of considerably greater discriminatory power within that family than had been produced in earlier work with Meta-DEJADRAL, even though the training set used Was only half as large as that used earlier. This "discrimination option", now integrated with the new template- processing capability, is currently being further tested on a group of aromatic esters to determine whether the rules formed are consistent with what is known about the fragmentation modes of those molecules, and whether the rules have significant discriminatory power outside the training set used to form them. 79 1977-78 Annual Report RR-00612 Section 3.3 3.3 Improved Ranking Capability The program used within the Meta-DENDfWL framework to rank candidate structures has been improved in several ways. A) The program now summarizes its own results and pcints the sumnacies, thus eliminating much tedious manual analysis that previously was necessary. This makes possible a much more systematic and extensive investigation of scoring functions and their behavior than was previously possible. E3j A large number of new scoring functions have been made available, many of them specially designed for use with rules formed under the "disccinination option." c1 A new ranking method has been implemented as an option, with an eye toward improving the application of scoring functions in canking. This new method eliminates duplicate explanations of peaks (which were previously permitted4 in a principled way. The new method may be easier to justify theoretically, and yielded generally better ranking results than did the old method in tests performed with monoketoandrostanes. Further tests ace planned with aromatic esters and marine stecols. 3.4 Data Selection Program It is a cormnonplace of methodology that good inductive generalizations depend on variety in the data set. This is no less true in the context of rule formation by Neta-DENDRAL. Whether the goal is to discover rules of high generality or high discriminatory power, one's chances of achieving this goal [appear to] increase with increasing variety of training instances. This suggests that it would be useful to have a data selection program that would select the subset of the potential training molecules which has 'the greatest variety, in some appropriate and well-defined sense. A pceliminacy version of such a pcogca has been implemented, and experiments with it will soon be underway. The method employed has two steps: A.4 Construction of an index of all the structurally different possible fragmentation environments permitted in the molecules of the set of potential training molecules (PT( by the "half order theocy" of imass spectral fragmentation. B.3 Construction of an n-sized subset of PT that contains nearly the largest number of different permitted fragmentation environments possible for a set of that size. 80 1977-78 Annual Report RR-00612 Section 3.5 3.5 Feedback Loops 3.5.1 Filtering with F&spect to Existing Ffules The RULEGEN program is capable of accepting previously defined rules as a means of filtering the evidence obtained from INTSUM before the evidence is used for rule formation. As well as providing a convenient and natural feedback mechanism for the p9ra.w this facility also allows rules obtained from other sources to be used to reduce the space which the program must examine to find rules for a given set of data. In this manner, the program is able to focus attention on evidence which is not already explained by any of the rules which it is given. A problem with this approach arises from the fact that the spectral evidence may often be the result of more than one fragmentation. Yet the filtering mechanism assumes that any evidence which supports a rule is completely accounted for by that rule. Tests are in progress to determine the limitations of this approach. 3.6 Prcgramimprovements 3.6-l Defining Ruleswith EDITSTRUZ In addition to the programs which produce rules from the spectral data, other programs have been developed to allow a user to define a set of rules manually. Dike the rules produced by RULEGENandRULJZMOD, these are rules of structure fragmentation which are expressed in terms of molecular subgraph descriptions. The programs for manual rule definition provide a simple yet useful language for the description of these rules. A principle part of .thi.s language is the EDITSTRUC language, developed for CCXGEN. This allows us to take advantage of the advanced structure manipulation capabilities which are a part of the EDITSTRUC package. The ability to create rules manually should be particularly useful in conjunction with the rule filtering mechanism of RULEGEN mentioned previously. This provides the chemist with a natural means of describing obvious rules which the program can eliminate from consideration before focusing on the remaining unexplained evidence. 3.6.2 Stability Rules in INT!XM and RULE= The programs have been generalized to allow the analysis of the mass spectral data from the point of view of determining 81 1977-78 Annual Report RR-00612 Section 3.6 rules about stable bonds, i.e., lack of fragmentation in a molecule as well as fragmentation. Just as peaks are evidence of fragmentation in a structure, absence of peaks is evidence that certain fragmentations have not occurred. Ihe programs are now capable of examining the original data from either point of view and proposing rules of behavior of the molecules from that point of view. Further work remains to be done to carry this generality through the processing performed in RULEMOD and then in conducting experiments to determine the usefulness of stability analysis. 3.6.3 Expand&l Template @ace Originally, the subgraph descriptions in the rules produced by the RDLEGEN program were restricted by requiring that the internal connection patterns of the subgraphs had to be completely specified. In other mrds, for each of the interior nodes in the subgraph, the complete set of neighbors had to be specified. This restriction excluded rule forms which seemed to be both plausible and desirable, so the program was changed to eliminate the restriction. In terms of the mechanism used by the program to search the space I implementation of this change meant removing the restriction on the subgraph matching templates that the neighbors property be required at all but the outer levels of a template. This allows the program to find rules in which the internal connection patterns of the chemical subgraphs are only partially specified. For example; it is now possible to express a rule such as "break any bond which is 2 bonds away from an oxygen atom". Such a rule could not be expressed previously without identifying whether the nodes between the oxygen atom and the break were secondary, tertiary, or guaternary. 3.6.4 Small LISP and Program Efficiency Increased size and complexity of the Meta-DENDRAL software has resulted in increasing efforts aimed at making the programs more efficient and understandable. All the programs which are part of the meta-DENDF!AL system are now capable of running in the environment of "small LISP". This makes considerably more memory space available to the chemist for the data structures, thus making possible the solution of significantly larger problems than were possible in the standard LISP environment. 82 1977-78 Annual Report RR-00612 Section 3.6 3.6.5 Help Facilities As the programs have increased in complexity and usefulness, we have had to face problems of documentation and explanation of the programs to its users. Text explanations of the various aspects of the programs must be provided, and kept up to date, to allow others to use the system. It is also important that the text descriptions of the programs be available to the programs themselves to be used during program execution to provide on-line guidance to the user concerning the use of the programs. Text descriptions of the programs must be closely associated with the programs themselves to insure that program changes are reflected accurately in changes in the text which describes them. Yet text explanations must be incorporated into the programs so as not to take up space which should be available during program execution to be used for producing results. Attempt has been made to resolve these sometimes conflicting goals through the useof the cormnent facilities of LISP, and through the generation of programs and conventions for prograrrrning which allow program documentation and explanations to be incorporated into the programs as comments in the appropriate places. There are then programs which have access to this information to produce documents and on-line explanations about the programs. 4 COLLAMRATIVERFSEAPCH 4.1 CCNGEN Users Dr. Peter Gund of Merck, Sharpe and bhme Laboratories contacted us for a current CONGEN manual and Guest login information. He now feels that he has analytical problems which would lend themselves well to checking with CONGEN. Professor Richard E. Moore of the University of Hawaii visited Stanford and was provided with a CaGEN demonstration on a problem relating to his own marine sterol work. We discussed system 'access and Tymnet node availability with him. He plans to return in the near future with another problem, and then consider the possibility of requesting access. Dr. Jean-Claude Braekman of the University of Brussels travels across Brussels to use a terminal at the offices of the Belgium Chemical Society, in order to access CONGEN on SUMEX. Dr. Braekman uses the mail facilities to remain in contact with Prof. Djerassi's research group. 83 1977-78 Annual Report RR-00612 Section 4.1 Dr. Martin Huber, a postdoctoral fellow in Professor Wipke's SECS group has been starting work in an area which was related to the graph theoretic basis for CONGEN. In an effort to encourage cross-fertilization or ideas, we encouraged and arranged a meeting between him and several of the DENDRAL project members. The resulting discussion, at the least, provided Dr. Huber with suggestions and information for further study. Likewise, DKWDRAL was able to obtain a better idea of similarities in research interests between the two groups. We are currently pursuing several problems in graph theory concerning analysis of molecular structures. These problems arose directly from this meeting and concurrent discussions with Prof. Wipke. During the special symposium at the San Francisco ACS meeting in the fall of 1976 which Ms. Suzanne Johnson helped to organize and chair, members of the DENIXAL group provided on-line demonstrations of CONGEN during the "hands-on" session. At this time Professor Kurt Mislow of Princeton University expressed interest in using the program. Later, we provided him with Guest access information and answers to his questions concerning terminals and other useful programs available to chemists on various commercial networks. As a result of this effort, Professor Mislow has used CCS%EN and has been considering its use as a teaching aid. He wrote us this past spring to enquire whether Guest access to CCNGEN might be possible for his friend Professor Weiss, head of the Department of Chemistry at Northeastern University. 'We subsequently provided Professor Weiss with the information necessary to access CCXGEN on a trial basis. In November 1976, Dr. Stan Lang of Lederle I&s' Infectious Disease &search Section, requested access to CCXEN. After being providing with the appropriate information and initial help, he encouraged Dr. Leon Goldman to request access also, and to request information on obtaining a copy of the teletype DPAW program used to draw CONGEN structures on teletypes. Arecent phone conversation with Dr. Babu Venkataraghavan, a new member of the research group at Lederle, indicated that the TTY DRAW' program was being used quite successfully. Also interested in the possibility of support for graphics terminals, Dr. Venkataraghavan called to discuss the problem in terms of Qmrigraph, which they already have on their PDP-10. We have exported a complete copy of all the DRAW program files, including ample data files, to Dr. Venkataraghavan and are currently in contact with him on implementation questions. A further example of cooperation between DENDRAL and Professor Wipke's group concerns the sharing of graphics programs. DE%DRAL obtained the Fortran sources for programs created by the SECS group to do molecular modelling and structure display on the DEC GI40. Wanting to interface these programs to CONGEN, but not wanting to limit CCRGEN graphics to one terminal 84 1977-78 Annual Report RR-00612 Section 4.1 type, DENDPAL personnel modified the program to use the Qnnigraph graphics package available on SDMEX. Glenn Ouchi of the SECS project, has become familiar with the relationship of the graphics in CONGEN to the Bodeller's graphics. SECS has become aware of the desirability of supporting additional terminal types for graphics output, and will be investigating Qnnigraph applications to this area. One of the students who used CONGEIN in Prof. !3jerassi's molecular structure elucidation course introduced the program to a graduate student of Professor E.J. Eisenbraun's (Oklahoma State University). Professor Eisenbraun is a well known marine natural products chemist- FIe has requested Guest access information, and appropriate materials were provided in spring of 1977. Professor Eisenbraun subsequently visited Stanford and got a personal demonstration of COJ3GEN. We have been in contact with Dr. Karl Kuhlman, a chemist and PROPHET user at SRI International. We have arranged for a group of DE'NDF&L chemists to get together with the SRI group for exchange demonstrations: CCNGEN for PPOPHET, and discussion of similar problem areas with visiting PROPHET representatives. Dr. David Pensak of Dupcnt in Wilmington, Delaware originally started out as a Ct3JGEY Guest user. In return, he contributed a good deal of knowledge concerning evaluation and use of molecular modelling programs. At the current time he is beginning to a build a research group in computer applications in chemistry, and views SDMEX/DENDR& somewhat as a resource from which to obtain knowledge of hardware, software and people. Dr. Milton Levenberg of Abbott Laboratories first expressed interest in CONGEN at an ACS meeting two years ago. He was given an account and appropriate information at that time. He had used CNNIGRAPH to develop a program to display and plot mass spectra, which he gladly provided to us- That program now provides a means for chemists to obtain a plot of their spectra which have been obtained on mass spectrometers which are not yet equipped with automatic computer output. When Kent Morrill was a graduate student in chemistry he developed an interest in C3BJGEN and various of the Meta-DENDRRL programs. When he left recently for a job with Tennessee Eastman, he requested Tymnet login information to take with him. As a result of his interest, Dr. Gary Santee of Eastman Kodak in Rochester requested information for Guest access to C(Z%3ZNN. Kodak inay also be in the process of forming a computer applications in chemistry group, and once again, we seem to be viewed as a potential information resource in this type of effort. Dr. Gretchen Schwenzer was a postdoctoral fellow with DENDRALl. When she left Stanford for a job at Monsanto, it was 85 1977-78 Annual F&port RR-00612 Section 4.1 with the idea of taking part in helping to develop a computer applications in chemistry group. She tooviews SraiIEX asan information and know-how resource. To that end, we have had several phone calls and terminal links from her concerning graphics, terminals, modelling programs and text editors. She is interested in obtaining several copies of documentation preparation programs either developed or supported at SUPIJZX. Dr. Robert Shapiro of New York University came to visit Stanford in September of 1977 to learn to use CCNGEN. He spent a week in residence to discuss structure elucidation problems relating to nucleic acids and their interactions with other substances. We are also pursuing ideas on the automated analysis of W spectra of such compounds, based on empirical rules derived from study of known systems. In November of 1976, Dr. Henry Stoklosa of Ciba-Geigy approached one of the members of the DENDRAL project for trial use of INTSDM. During a subsequent' visit to Stanford, we introduced him to C@GEN and its use. We have been keeping him up to date on recent developments because he indicated that CCNGEN is beginning to have more and more use to him in the analytical task of evaluating additive bonding in polymeric materials. Dr. Geza Szonyi of Polaroid corporation was one of the original persons to enquire about SUMEX/CCNGEN access as a result of the "invitations for use" which were included as a part of early journal articles. He has recently requested trial access to mGEN. Phone conversations indicate that his group is evaluating computer systems which will offer them the greatest latitude in applying computers to their work in various fields of chemistry and related data management. Once again, DENDFJALis viewed as a potential knowledge source. Drs. D. Williams and R. McGrew from the Midland, Michigan site of Dow Chemical came to visit Stanford and receive an introduction to CONGEN. They were given a CONGEN demonstration, and as a result, requested a copy of the teletype DRAW portion of the program, which we sent to them. This brings to five the number of sites which are now using the teletype DRAW program in some fashion, Also included are: Lederle Labs in New York, (Dr. Babu Venkataraghavan); Dept. of Computer Science at SUNY, (Dr. Dave Larson); Dept. of Chemistry, Arizona State Univ., (Prof. Morton Munk); Dept. of Chemistry, Niyagi Institute, (Prof. Hidetsugu Abe): and Cambridge University, (Neil Gray). 4.2 MarineNatural Products 86 1977-78 Annual Reprt RR-00612 Section 4.2 4.2.1 Mass Spectral File Search System An attempt was made to obtain mass spectra for all marine sterols reported in the literature (Appendix A). The old mass spectral files were scanned and pertinent sterol mass spectra were digitized (a file of non marine sterol mass spectra were also acquired from the older files as a supplement to themarine file) (see Appendix B. Marine sterol researchers were requested to send samples of specific sterols which they reported or sterol mixtures known to contain the requested sterol (see Appendix B. In a few cases sterols were isolated from crude extracts of organisms known to contain the sterols. The high resolution X- MS spectra of the available sterols were recorded using a Hewlett Packard 7610A gas chromatograph equipped with a 10' X 2 IIITI "U" shaped column (3 per cent Foly S-179 on gas chrom Q or 3 per cent OV-17 on gas chrom Q (column temp. 260 degrees C) and interfaced with a Varian Mat 711 double focussing mass spectrometer (equipped with a Watson-Biemann dual stage separator, an all glass inlet system and a PDP-11/145 computer for data acquisition). High resolution spectra were recorded for subsequent fragmentation analysis by the application of date interpretation and summary programs, e.g.,. INTSUM, and to facilitate handling of the data for construction of the searchable files. Within the framework of the available data acquisition and reduction systems, the rapid analysis scheme has been tested, and the advantages and limitations are the subject of the following section. The spectra of 52 marine sterols were compiled in a computer searchable format. The spectra, which are essential to have available for careful comparison following the search report, have been plotted, and the plausible or established interpretations of the higher molecular m/e peaks have been indicated on the spectra. Spectral interpretations have been coded in Fig. 8 in a series of 32 symbols which have been appropriately marked on the spectra of each sterol in Appendix C tiich is the file of marine sterol spectra constructed in our laboratory. Attached is a list of investigators who reported and received copies of this file. This summary of proposed fragmentation rules is acting as a preliminary guide in the INTSUM evaluation. The S-H program was used to match every spectrum in the file (Appendix C) to every other spectrum to gain an indication of how all the spectra rank to one another in terms of the similarity index described previously (Table V). A rank of 999 indicates a psitive identification: therefore, each spectrum when compared against itself results in a rank of 999. Ranking values below 500 indicate positive nonidentity and are not recorded. Ranking values approaching 750 indicate a possible match is not ranking higher due to variations in spectrometer operating conditions. Table V displays a numbar of interesting results. First, several separate sterols rank at the identity 87 1977-78 Annual Report RR-00612 Section 4.2 rank, that is, they have mass spectra which are similar enough to be basically indistinguishable: Sterols 15 and 18 Appendix A: this indicates that mass spectrometry cannot distinguish between slightly different side chain alkylation patterns in some cases. This agrees with the similar evaluations in the literature. Sterols 68 and 71: this indicates that mass spectrometry cannot distinguish between side chain double bond geometrical isomers (E and Z) in this case. Sterols 90 and 80: these are again sterols with slightly different patterns of side chain alkylation. See pp. 88a-c for Table V. 88 Table V. LIBRARY SEARCH REPORT FOR EXPERIp!NT SEARCHING 52 SPECTRA IN 1MARINE AGAINST THEMSELVES STEROL -13 10 14 -ii- 16 17 18 23 25 26 27 999 99 599 547 554 999 G-- 999 _.. . 399 508 ?99 299 376 555 504 999 999 888 612 510 099 999 999 745 594 547 999 631 642 581 999 647 990 593 581 5e0 399 629 Table V. (cont. 1 Qab STEROL STEROLS MATCHED 44 -ii 60 61 -ii 68 999 114 Cl 0 398 (24S)-24-NETHYLC~OLEST~~5,2S~O~E~~-3B~TA- 660 79 999 ij 398 24-~~THYLC~OLEST~~5,22e-DTENIJHETA-oL 635 75 999 $d 398 516 64 ~ALP~AI~~~~ETHYLCYOLESTAI~,~~EIJ~ETA-U~ 0 0 412 ~2~S)-24-ET~iYtC~OLESrA-5,25-DIE~J-3~~TA.O - 999 ?nG 999 0 398 58 24-f~FTHYLC#oLESTA-5,2~~2~)-~I~h-3BETA-~L 547 999 R 398 521 61 3~-~ET~iYtCHOLESlA-5,22~~O~E~~3~ETA~OL 990 U 426 GORGCST-5-EN-3nEtA-UL 527 39 999 0 426 r2dZ)-2daPROPYLJnENEt~O~-E~TA-5-F~~3~ETA- 9991132 999 .- 999 86 999 1.j 481; 5AI PtlAc24-~ETWYiCHQLEST-7-EN-3~ElA-~L 666 55 999 c" 414 5ALPrA.24-FTwYLCwotEsT-7-fN-39Eth-oL 529 45 999 C'I 386 SALPrn~C~OLES7-7-E~~~3RET,~-O~. 994 Sif 999 (: 612 S~LPYA~24-~7~YLC~OLES;a-7r22~-OI~N-3~ET~ 524 62 0 k? 412 230NOR-GGRGOST-S-EN-30ETA-OL. 51 59 999 ~1 413 ~~-~~T~YLC~~ILESTA-S,~?E-O~E~~~~~ET~-~L 52 44 99s u 412 23,24-OI~ETtiYLC~~LESTA~5,22~@~~l~-3RETb-~ I - Table V. (cont.) STEROL STEROLS MATCHED t- 75 77 1 78 79 82 t- 86 88 I-- 96 l- 1977-78 Annual Report RR-00612 Section 4.2 An important limitation of the file search system is then its inability to distinguish between variations in side chain alkylation. These various side chain alkylation patterns are very important with respect to biosynthetic processs. Since these sterols have different retention indices, this limitation has been overcome by searching a file of retention indices as well as mass spectra. A computer program for accurately calculating retention indices has been developed by William Yeager, Department of Genetics, Stanford University, and is applicable to the rapid analysis sequence. Michael Kohraman has prepared a file of carefully measured retention indices from samples used to compile the mass spectral file; (Table VI) therefore, the limitations concerning identification of isomeric side chain alkylation patterns have been reduced. See p. 8ga for Table VI. 89 TABLE Yf RETENTION INDICES OF STEROLS OF SP2250 , 3047 4 1 3282 4 3268 10 3317 1 I I " -. z Y- I , ---- <- 3442 I I I I I I I m b 336 1 L. 4' 3334 24 3383 271'3402 ~$3245 zs/ 3290 3oj 31 3322 ry I' 7l378 14. 2 2 3219 2 I 1 I 23l 241 I I 1 I I 1 I 33 I I I I I 34 34 371 I 3497 35i'3480 4.4 3492 4xi 3292 4a I I ! ! ! 4 I 47 I 3538 .49 `3474 ' 5x uB* I c7 I 5 LLl. I I - ~s9l I I - , d2-v --, I I I I I I 3489 7q I I I I I I I 1 1977-78 Annual Report RR-00612 Section 4.2 second, some sterols have very distinctive mass spectra with respect to the other spectra in the file, and no other spectrum ranks above 500 (for 17 spectra): however, the majority of spectra do show some similarities to other spectra in the file, i-e, have a cross rank > 500 with another sterol mass spectrum in the file. It is interesting that sterols which are saturated match only with other saturated sterols, sterols with one nuclear unsaturation match only with other sterols with one nuclear unsaturation, sterols with 2 nuclear unsaturations match only sterols with 2 nuclear unsaturations, and sterols with one nuclear and one side chain unsaturation (or ring junction) match only sterols possessing that property. The empirical ranking algorithm described previously has detected the number and general positions of unsaturation in the sterols. Therefore, if a new sterol is detected by the file search procedures then the general structural properties of the new sterol (number of nuclear and side chain double bonds) may be indicated by the structures of the sterols with which it is ranked even though the ranking values are very low. The real utility of the search system will be in rapidly sorting a tremendous quantity of experimental data in an effort to reveal the sterols of novel structure. This is of tremendous utility because marine sterol mixtures are generally complex, containing over 40 sterols in some cases. However, once the sterol of novel structure is pointed out, then a careful analysis of the mass spectral fragmentation in terms of known processes must proceed. Rules generated via INTSW, etc. analyses of the extensive marine sterol high resolution mass spectral files will help greatly by providing firm guidelines for the structural evaluations of the previously unencountered sterols. 4.2.2 Researchers Receiving Marine Sterol Data Dr. J. 8. Heather The Upjohn Company Chemical Process, Rsch & Development Kalamazoo, Mich. Dr. Steven C. Welch Dept of Chemistry University of Houston Houston, Texas 77004 Dr. Richard M. Wing Univ of California Riverside, Ca. 90 1977-78 Annual Report RR-00612 Prof. Paul J. Scheuer University of Hawaii 2545 The Mall Dept of Chemistry Honolulu, Hawaii Dr. Yuzura Shimizu Univ of Rhode Island College of Phar.macy 53 Foqarty Kingston, R-1. Dr. Maktoob Alam University of Houston College of Pharmacy Dept. of Med. Chem. and Pharmacgnosy Houston, Texas 77004 Dr. Ron Quinn Wche Research Inst. P. 0. Box 255 Dee Why NSW 2099 AUSTRALIA Section 4.2 Dr. K. Ivanetich Dept Physiol. & Med. Biochemistry Medical School Observatory, Cape SomH AFRICA 91 1977-78 Annual Report RR-00612 Section 4.2 5 Carbon-13 Work The work described in this section was accomplished in conjunction with work on structure elucidation and theory formation programs (sections 2 and 4). It is presented together here to make a more coherent presentation. Carbon-13 nuclear magnetic resonance (CMR) has developed into an important tool for the structural chemist. A natural abundance CMR spectrum which is fully proton decoupled consists of a number of sharp peaks which correspond to the resonance frequencies in an appliedlogagnetic field of the various types of carbon atoms present. A C shift is the amount an observed peak is shifted from that of a reference peak, usually tetramethylsilane (TMS) . In last year's annual report we discussed an extension of Meta-DENDPAL which allowed the program to form rules in the domain of CMR spectroscopy. During the past year we continued work on this program , and wrote a second program which applies CMR rules to structure elucidation problems. Pules generated from a combined set of paraffins ?t!l acyclic amines have been used to successfully identify the C NMR spectra of molecules not in the training set data. The introduction of a limited set of stereochemical terms to the rule generation procedure demonstrated the feasibility of extending the method to more complicated systems. A description of the rule formation and structure elucidation programs is given in [171. Results are presented there for the combined set of paraffin and acyclic amines, as well as for a combined set of trans decalins and monohydroxylated androstanes. 5.1 Rule Formation *suits A set of rules was genera !@ using a subset of the paraffin data from Lindeman and Adams acyclic amine data from Eggert and combined Wf.~;~~~zstw;~ th: Djerassi empirical formula training set for C H20 la er ii! and C6H15N were excluded from the use in testing the generality of the rules. The rule set was tested by generating all structural isomers with the empirical formulas C H20 (35 isomers) and C6H 5N (39 isomers), predicting the spec um of each isomer, e ti en ranking the predicted spectra by similarity to a known spectrum. 'Ihe rank of the predicted spectra associated with the correct candidate structure provides an indication of the utility and ----- l2 Lindeman, L.P. and J.Q. Adams, Anal. Chem,, (1971), 43,~. 1245. l3 Dggert, H. and C. Djerassi, J. Am@r. them. sot. (1973),95,p. 3710. 92 1977-78 Annual Report RR-00612 Section 5.1 validity of the generated rules. For the above test we used the 24 CgH2 spectra available from the work of Lindeman and Mams. The pr ea icted spectra of the 35 structural isomers were compared and ranked against each of these available spectra. The results of this ranking for CgH20 shown in Table VII. as well as a similar test on C6Hl5N are Empirical Number-of Numberof Bank of Correct Structure Formula Candidates Spectra (Freg of Correct 1st 2nd ...6th.%ing) c9 H29 35 24 20/24 3/24 l/24 '6 H15 N 39 11 8/11 2/11 l/11 Table VII. Results of Structure Banking 5.2 Adding Stereochemistry to the Rule Language The work on the paraffins and acyclic amines requires only topological descriptors in the lj anguage of atom features. Because af $e dependence of C shifts on stereochemical features it is necessary to have the facility to include stereochemical terms when they are required. Substituents placed on systems which have static conformations such as trans decalin and androstane with trans ring fusions can be described in discrete terms. The terms we selected describe the orientation on the ring of the substituent as either axial or equatorial, and either alpha or beta. For instance, a substituent is beta in lo- methyl-trans-decalin if it is on the same side of the ring as the methyl group and alpha if on the opposite side of the ring from the methyl group. The rule generation program with the extension of the language to include these atom features was run on a combined set of trans decalins, lO-n&hyl-trans-decalols and monohydroxylated androstanes with tran f6 ring fusions select from the works of Grover and Stothers and Eggert et. al. ti 14 - - Grover, S.H. and J-B- Stothers, Can. J. Chem. (1974),52,p. 870, l5 Eggert, H., C. VanAntwerp, N. Bhacca, and C. Djerassi, J. Org. Chem.,(1976),41,p. 71. l6 Grover, Cp, cit. 17 Eggert, OP. cit. 93 1977-78 Annual Report RR-00612 Section 5.2 Sixty rules were generated to cover the 249 data peaks of 17 compounds. Samples of the rules generated are shown in Figure 11. The examination of these rules will show that they are useful for the chemist who wants to study contributions to the total shift as well as for structure elucidation. See p. 94a for rules. Figure 11. Sample rules constructed from decalins and hydroxy steroids with trans ring fusions. The '*I identifies the carbon atom to which the shift is assigned. is in ppn downfield from TMS. 5.3 Structure Elucidation IWleculx structure elucidation using CMR consists of using a set of rules which simanarize the CMR behavior of a set of compounds to identify other unknown compounds within that or similar classes. The information which the chemist must supply to the structure elucidation program includes the empirical formula of the unknown as well as its observed spectrum. Two parameters may be set by the chemist to select the number of plausible structures to be determined, and to specify the error range in ppm which should be assigned to the rules to account for deficiencies in the training data, experimental error, solvent effects, etc. From this information and its store of CMR rules, the program assembles a set of structures which are plausible sources of the unknown spectrum. Molecular structure elucidation is accomplished by our program by selecting a shift (peak) in the observed spectrum, then finding the rules which are possible explanations for this shift. The rules selected postulate partial substructures which 94 -' Alpha Carbon Rules I I C c\c/ \ + 70.0 d &4,(70.5 I* OHeq Beta Carbon Rules *kc I I I Y/"\ - 35.6<6&36.4 Gamma Carbon Rules 20& S(*),( 20.5 I I C Yc( \ __+ 71.8,<&~& 72.5 I OHax OH-C& ax I I __ic 67.6,