[GRAPHIC] \WP14CVR.GIF MEMBERS OF THE FEDERAL COMMITTEE ON STATISTICAL METHODOLOGY (June 1986) Maria E.Gonzalez (Chair) Office of Management and Budget Barbara A. Bailar William E. Kibler Bureau of the Census National Agricultural Statistics Service Yvonne M. Bishop David Pierce Energy Information Federal Reserve Board Administration Edwin J. Coleman Thomas Plewes Bureau of Economic Analysis Bureau of Labor Statistics John E. Cremeans Jane Ross Office of Business Analysis Social Security Administration Zahava D. Doering Wesley L. Schaible Defense Manpower Data Center Bureau of Labor Statistics Daniel E. Garnick Fritz Scheuren Bureau of Economic Analysis Internal Revenue Service Terry Ireland Monroe G. Sirken National Security Agency National Center for Health Statistics Charles D. Jones Thomas G. Staples Bureau of the Census Social Security Administration Daniel Kasprzyk Robert D. Tortora Bureau of the Census National Agricultural Statistics Service PREFACE The Federal Committee on Statistical Methodology was organized by OMB in 1975 to investigate methodological issues in Federal statistics. Members of the committee, selected by OMB on the basis of their individual expertise and interest in statistical methods, serve in their personal capacity rather than as agency representatives. The committee conducts its work through subcommittees that are organized to study particular issues and that are open to any Federal employee who wishes to participate in the studies. Working papers are prepared by the subcommittee members and reflect only their individual and collective views. The Subcommittee on Statistical Uses of Microcomputers in Federal Agencies organized a one-day workshop held on April 24, 1985. This working paper is based on the workshop and discusses four topics: planning to buy and use microcomputers for statistical purposes; electronic data dissemination; applications of microcomputers; and expert systems. The report is intended to provide helpful guidance to Federal agencies in purchasing and using microcomputers for statistical purposes. The Subcommittee on Statistical Uses of Microcomputers in Federal Agencies was chaired by Terry Ireland of the National Security Agency, Department of Defense. MEMBERS OF THE SUBCOMMITTEE ON USES OF MICROCOMPUTERS IN FEDERAL AGENCIES Terry Ireland*, Chair National Security Agency Ken Berkman Michael Leszcz Bureau of Economic Analysis Internal Revenue Service Jay Casselberry Tom Nagle Energy Information Administration Internal Revenue Service Frederick J. Cavanaugh Ronald Steele Bureau of the Census National Agricultural Statistics Service Lawrence H. Cox Peter Stevens Bureau of the Census Bureau of Labor Statistics Richard Engels Linda Bouchard Taylor Bureau of the Census Internal Revenue Service Maria E. Gonzalez* (ex officio) Mark Winer Office of Management and Budget Office of Management and Budget *Member, Federal Committee on Statistical Methodology ACKNOWLEDGEMENTS The idea of a workshop as a focal point for proceedings on Statistical Uses of Microcomputers was suggested by Maria Gonzalez, Chairperson of the Federal Committee on Statistical Methodology. She also provided contacts in many Federal agencies, which made possible a broad Federal participation in the workshop. The planning of the workshop was done by the Subcommittee. Four topics were selected for the sessions of the workshop. The chairpersons designated by the Subcommittee organized each session. They were: Chairperson Session on Planning Lawrence Cox, Bureau of the Census Session on Electronic Data Ken Berkman, Dissemination Bureau of Economic,Analysis Session on Applications Ronald Steele, National Agricultural Statistics Service Session on Expert Systems Terry Ireland, National Security Agency The proceedings were prepared by the chairpersons and rapporteurs of each session based on input from the speakers. The Subcommittee thanks all the speakers in the workshop for their participation. Terry Ireland, who chaired the Subcommittee, and Norman Glick edited the final report. Linda Taylor ably handled all the organizational and administrative details of the workshop the real basis for a very smooth-running conference. -iii- FEDERAL COMMITTEE ON STATISTICAL METHODOLOGY WORKSHOP ON STATISTICAL USES OF MICROCOMPUTERS IN FEDERAL AGENCIES April 24, 1985 TABLE OF CONTENTS Page Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Members of the Subcommittee on Statistical Uses of Microcomputers . . . . . . . . . . . . . . . . . . . . . . . . . . .ii Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . iii Introduction. MARIA E. GONZALEZ, Office of Management and Budget. . . . . . . . . . . . . . . . . . . . . . . . 1 Session on Planning. . . . . . . . . . . . . . . . . . . . . . . . . 3 Summary. Prepared by FREDERICK J. CAVANAUGH, Bureau of the Census. . . . . . . . . . . . . . . . . . . . . . 3 Introduction. LAWRENCE H. COX, Bureau of the Census. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 The Census Bureau Microcomputer Information Center. RONALD SWANK, Bureau of the Census . . . . . . . . . . 6 The National Security Agency Personal Computing Information Center. KATHY SCHNAUBELT, National Security Agency . . . . . . . . . . . . . . . . . . . . . . . .11 II Use of Microcomputer Technology at the Bureau of Labor Statistics. PETER STEVENS, Bureau of Labor Statistics. . . . . . . . . . . . . . . . . . . . . . . .13 Discussion. LAWRENCE H. COX, Bureau of the Census. . . . . . . . . . . . . . . . . . . . . . . . . . . . .23 Questions,and Answers . . . . . . . . . . . . . . . . . . . . .25 Session on Electronic Data Dissemination . . . . . . . . . . . . . .29 Summary. Prepared by JAY CASSELBERRY, Energy Information Agency. . . . . . . . . . . . . . . . . . . . . . .29 Use of Microcomputer Disks to Disseminate Information. STUART WEISMAN, National Technical Information Service . . . . . . . . . . . . . . . . .29 Cendata: Development and Implementation. BARBARA ALDRICH, Bureau of the Census . . . . . . . . . . . . .34 Electronic Dissemination of Perishable Information. ROXANNE-WILLIAMS, U.S. Department of Agriculture . . . . . . . . . . . . . . . . . . .38 Questions and Answers . . . . . . . . . . . . . . . . . . . . .40 Session on Applications. . . . . . . . . . . . . . . . . . . . . . .45 Summary. Prepared by THOMAS NAGLE, Internal Revenue Service . . . . . . . . . . . . . . . . . . . . . . . .45 -iv- Spreadsheets and Statistical/Econometric Applications in Econometric Research. LINDA P. ATKINSON, U.S. Department of Agriculture . . . . . . . . . .46 Spreadsheets and Data Base Applications Used by the Crop Reporting Board in Reviewing Survey Indications and Preparing Publications. GARY NELSON, U.S. Department of Agriculture. . . . . . . . . . . . .50 Manager's Perspective on the Acquisition and Use of Microcomputer-Based Graphics Packages. RICHARD W. HAYS, Internal Revenue Service . . . . . . . . . . .51 Current Applications of UNIX-Based Microcomputer Systems. BRIAN CARNEY, U.S. Department of Agriculture . . . . . . . . . . . . . . . . . . . . . . . . . .54 Equipped for the Future? PAUL DOBBINS, U.S. Department of the Treasury. . . . . . . . . . . . . . . . . . .56 Concerns About Data Integrity, Security, and Accessibility in an Environment Where Microcomputers and Mainframes Are Interfaced. DICK SHIVELY, U.S. Department of Agriculture. . . . . . . . . .58 Questions and Answers. . . . . . . . . . . . . . . . . . . . . . . .61 Session on Expert Systems. . . . . . . . . . . . . . . . . . . . . .67 Summary. Prepared by NORMAN GLICK, National Security Agency . . . . . . . . . . . . . . . . . . . . . . . .67 Introduction. TERRY IRELAND, National Security Agency. . . . . . . . . . . . . . . . . . . . . . . . . . . . .69 Expert System Tutorial. GEORGE LAWTON, Army Research Institute. . . . . . . . . . . . . . . . . . . . . . .70 An Extension of Statistical Software to Expert Systems. JAMES J. FILLIBEN, National Bureau of Standards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78 Editing and Imputation. BRIAN GREENBERG, Bureau of the Census . . . . . . . . . . . . . . . . . . . . . . . . .85 Discussion. MARK WINER, Office of Management and Budget. . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Questions and Answers . . . . . . . . . . . . . . . . . . . . .94 Appendix. Announcement of Workshop on Statistical Uses of Microcomputers in Federal Agencies. . . . . . . . . . .97 -v- INTRODUCTION Maria E. Gonzalez, Office of Management and Budget A subcommittee of the Federal Committee on Statistical Methodology organized a one-day workshop on statistical uses of microcomputers in federal agencies. The purpose of the workshop was to share information among federal agencies on the statistical uses of microcomputers. About 200 persons from federal agencies attended the workshop. The audience had an opportunity to ask questions and make comments in the discussion period of each session. All were acquainted with the uses of microcomputers. Some were also responsible for the planning of statistical uses of microcomputers in their agencies. The announcement of the workshop is included in the Appendix. Four topics were discussed at this workshop. 1. Planning of Statistical Uses of Microcomputers. The first session described three microcomputer information centers in federal agencies. The purpose of personal computer (PC) information centers is to familiarize the agency users with the PC potentialities. This session focused on planning#, implementation, and evaluation within federal agencies of statistical uses of microcomputers. The main questions asked were: Who should have microcomputers? For what purposes should microcomputers be used? In what configurations? At what costs? How will microcomputers coexist with central automatic data processing services? 2. Electronic Data Dissemination. This session dealt with different data dissemination methods. The discussion covered each agency's approach to data dissemination and the problems encountered in implementation. 3. Applications of Microcomputers. This panel discussion focused on the usefulness and weaknesses of microcomputer software and operating systems, the interface of mainframes and microcomputers, and factors affecting data integrity, security, and accessibility. 4. Expert Systems The methodological basis for expert systems was discussed and several examples were given. The examples describe current expert systems with statistical applications. The proceedings of this one-day workshop follow. For each session there is a summary, the presentations, and the discussions that followed. -1- SESSION ON PLANNING SESSION SUMMARY* The microcomputer technology of the 1980s is a personal and, therefore, a user-oriented technology. However, planning for microcomputer,technology is often very complex and causes many changes in the workplace. Program planners must take many factors,into account when planning the introduction of a microcomputer system into their organization. Three personal computer information centers were described: The Census Microcomputer Information Center of the Bureau of the Census The Personal Computer Information Center of the National Security Agency The microcomputer system of the Bureau of Labor Statistics The planning, management and evaluation of microcomputer technology at the Census Bureau officially began in 1983 with a meeting of the Executive Staff. Prior to that time, microcomputer technology testing and evaluation work was ongoing at the Census Bureau, but this was the first time that agency-wide distribution of microcomputers was discussed. The Census Microcomputer Information Center (CMIC) was established as a result of this meeting. To give greater emphasis to the importance of microcomputer technology, the Census Bureau located the Center in the Office of the Director with its manager reporting directly to the Associate Director for Administration. The purposes of CMIC are to assist employees in learning about microcomputer technology -- both from a user point of view and a manager/procurer point of view -- and to reduce the overall costs of microcomputer technology purchase and maintenance. Employees are.given access to various brands of hardware and software to test prior to purchasing. They are also given "hands-on" experience in the use of the newest in microcomputer hardware and software through special arrangements made with the various vendors and manufacturers. On-site training in the use of hardware and software is provided by,outside trainers, with the divisions paying the costs for their employees. Costs currently range from about $100 to $125 per person per day, which are quite favorable in comparison with commercial costs of similar training. The activities of the National Security Agency's Personal Computer Information Center (PCIC) started approximately 18 months ago, when NSA ---------------------- *Frederick J. Cavanaugh, Bureau of the Census. -3- established the PCIC to train employees in the use of PCs and vendor- developed software. It did not take long to discover problems of compatibility among various brands of microcomputers. Therefore, standards were established to ensure that: 1.All microcomputer systems at NSA are compatible with one another for effective communications and portability. 2.All systems are able to function using the UNIX operating system -- again, to allow for communications and portability. 3.The microcomputer systems-are supportable; that is, they must be easily and cheaply repaired. 4.The systems are secure, so as not to divulge secret information. NSA has set its microcomputer standards around the IBM PC and PC/XT in a UNIX-based environment (IBM's PC/IX) and its office automation standards around the Wang PC. BLS's microcomputer system is essential for efficient office operation, and BLS has kept this in mind in designing-and developing its system. The BLS Executive Staff is very supportive of the microcomputer system. In designing the microcomputer system at BLS, several critical needs have to be met. These include: 1.The need for a system that can readily provide terminal communication with mainframe computers. 2.The need for a system capable of communicating among various machines and those located in field offices as well. 3.The need to provide security for confidential information. BLS undertook research and experiments to determine which microcomputer system best met its needs. Upon completion of the research, a single system comprised of machines from a single manufacturer was implemented and a set of standards was developed around its operation and use. The present system includes over 100 IBM PC/XTs and three Ethernet (FIPS 1O7) local area networks. The microcomputer systems described in the presentations form a continuum from the experimental or user-oriented approach to the more standard production or program-oriented approach. However, despite a commonality of needs and objectives, each agency has chosen a different approach to planning and managing microcomputer technology. -4- INTRODUCTION Lawrence R. Cox, Bureau of the Census Welcome to the Workshop on Statistical Uses of Microcomputers in Federal Agencies, sponsored by the Federal Committee on Statistical Methodology. We begin with this session on planning. Microcomputer technology is the technology of the 1980's. It is a personal and, therefore, a user-oriented technology. However, its focus on the individual often can be misleading from a planning perspective -- at the agency or office level, planning and managing the use of microcomputer technology becomes very complex very fast. While encompassing important technical issues concerned with hardware, software and communications networks, this technology also quickly brings the planner face-to-face with the.business of managing and deriving improvements systematically from technological change. Inevitably, the introduction of microcomputers into an organization changes the workplace and the skills and orientation of workers. It presents new choices and often demands that these be made swiftly. In large organizations and small offices, the following.questions must be addressed: - where does microcomputer technology fit into the agency or office? -how should it be introduced? -how can the organization experiment and grow with this technology? -what must the organization do to plan and manage this technology effectively? - should standards be set for its use? which 'standards? how should they be set? by whom? how should they be enforced? - what sort of future decisions need to be made, and who should make them? We are fortunate today to have a panel-of experts in this field, whose experience should shed light on answers to these and other important questions facing the statistical program manager about to embark on the introduction of microcomputers into his or her organization. They speak-with the experience of individuals tasked with managing groups assigned these responsibilities in three different Federal agencies: the Bureau of the Census, the National Security Agency, and- the Bureau of Labor Statistics. -5- The speakers are: - Mr. Ronald Swank, Manager, Census Microcomputer Information Center, Bureau of the Census. - Ms. Kathy Schnaubelt, Chief, Information Resources And New Technology Branch, National Security Agency. and - Mr. Peter Stevens, Chief, Division of Communications and Computing Technology, Bureau of Labor Statistics. Until recently Kathy was Chief of NSA's Personal Computer Information Center and had direct responsibility for the functions we will discuss this morning. Ron and Peter have had these responsibilities on a continuing basis for some time. Each speaker will make a brief presentation on how the problem of planning the use of microcomputers was addressed in their agency. I will follow with a few comments by way of formal discussion, and we will then open the floor for discussion and questions from the audience. THE CENSUS BUREAU MICROCOMPUTER INFORMATION CENTER Ronald Swank, Bureau of the Census The words "microcomputer" and "personal computer" are often used in a manner that blurs their intended use. In the true sense of the word "microcomputer," the Bureau of the Census has been using microcomputers since 1968. The FOSDIC (Film Optical Sensing Device for Input to Computers) allowed us to film and input census forms to computers without manual data entry. In 1973 we attached IBM 6250 tape drives to Sperry mainframes, approximately three years.before Sperry announced similar availability, and experimented with the ATL automated tape library. In 1982, eight Apple II+ personal computers were used to do the Puerto Rican Economic and Agriculture Census data checking and editing. These projects established the feasibility of using microcomputers in much of the Bureaus work. The Bureaus organization (3500 employees at headquarters, 9000 nationwide) can best be described as 35 separate companies (divisions, in our parlance) sharing the same resources, computers, management services, etc. You can imagine the problem this presents in setting priorities, standards and general directives. All of the Census Bureaus funding does not come directly from Congressional appropriations. So there has been a great deal of discussion on:the best way of introducing microcomputer technology to the Census user community, funding it and not intimidating or alienating Bureau users. In 1983, a joint decision was made to establish the Census Microcomputer Information Center (CMIC). The Center with a staff of 4 was placed organizationally in the Director's Office for two reasons: (1) to show Executive staff support for technology and.encourage users to make -6- active efforts to become familiar with its capabilities and (2) to avoid turf battles. The CMIC is a clearinghouse of information for use by Census Bureau employees. The goals and objectives of the CMIC are to: - assist Bureau employees in their analysis of microcomputers; -provide access to and demonstrations of a variety of hardware, software and peripherals; -provide hands-on experience with microcomputers without capital investment by the individual divisions; -provide training on microcomputer hardware and software; -provide a clearinghouse for documentation, catalogs, and pointers to knowledge for microcomputers, end-user computing and office automation; - decrease the cost of hardware and software through more informed procurement decisions. With the direction of program managers, Census Bureau employees may visit the Center for information about microcomputers, for discussions of the characteristics of particular computers and the applicability of microcomputers to projects, or for hands-on experience on a variety of machines in an attempt to implement those projects. One can use the computers in the Center for weeks if necessary, experimenting with various software on different machines. The role of the Center is to help Census Bureau staff define their processing needs, advise them of applicable software and guide them towards suitable computer equipment. The CMIC contains the more popular microcomputers and the more popular software. Yet, there are significant numbers of microcomputers that may provide a unique perspective in the industry and may offer the best overall systems for a particular problem. Therefore, the CMIC also sponsors product demonstrations about those microcomputers that are not currently on display in the Center. CENTER OPERATION/USE The Center's hours of operation are 9:00 a.m. to 4:00 p.m. Census Bureau personnel may schedule time to use a particular machine, software package, tutorial or specialized peripheral device for one- hour segments. They may also request one of the Center support personnel to work with them. We generally have personnel in the Center from 7:00 a.m. to 5:30 p.m. Time before and after hours of operation is devoted to Center personnel, allowing us to gather and exchange information on the day's occurrences and to provide specialized support to executives. -7- Some of the typical questions arising on a given day may be as simple as: What's the difference between a hard disk and a floppy? How can I get specific information on a specific product and its capabilities? What kind of tutorials/training are available for Lotus, dbase, etc.? Why doesn't a package perform in a specific manner? WHAT's AVAILABLE IN THE CENTER The Center subscribes to approximately 40 periodicals dealing primarily with microcomputers and associated technology. About 20% of these magazines are provided free. Also the Center has a library of 300+ books dealing with microcomputers, hardware, software, peripheral devices, etc. These books are directed at all levels of personnel. The magazines and books are available for checkout by Census Bureau employees. The Center subscribes to Data Pro for microcomputer hardware and software. There are numerous other vendor- or industry-provided catalogues available for review in the Center: IBM Personal Computer and XT Software Guide The Blue Book for IBM Engineering and Scientific Progress The Book of Apple Software The Ratings Newsletter IBM Software Directory Many of the supply and peripheral device catalogues are provided by vendors Public domain software is available in the Center. Most of it was acquired from Capital-PC for IBM's and compatibles and the Freeloader 500 software for the Apple machine. This software is not copyrighted and is available for the cost of reproduction. We have found many useful utilities available that have saved our users much development time. Microcomputer software in the following categories is available: Communications software Mathematical Database management systems Specialized Electronic spreadsheets Statistical Integrated software Word processing Presentation graphics Utilities Programming languages This software is available for user evaluation. The end user determines whether the product will produce the required results. About 15% of our software was provided by vendors for use in the Center -- but only in the Center -- for evaluation purposes and not for production work. The Bureau's policy on copyrighted software is that it is not to be copied for any reason other than backup. -8- HARDWARE There are 2 IBM PC/XT's hooked to a local area network. A Sperry Model 50, a Wang PC a Grid, Apple Macintosh, peripheral devices, plotters,- printers, Polaroid palette, etc. are available. Many microcomputer vendors (43 to date) have come to the Census Bureau to demonstrate their products, and many have loaned their products for evaluation from 30 to 60 days, depending on product. Some of the vendors are: A & F Computers Sony Digital Equipment Fujitsu Olivetti Motorola Hewlett-Packard Exxon Data General Radio Shack ELECTRONIC BULLETIN BOARD In February 1985, an electronic bulletin board was placed into service to facilitate information interchange on product evaluations, user projects, etc. PROCUREMENT POLICY The Census Bureaus procurement policy evolved because of our organizational structure and our funding. While all procurement actions are to be processed and controlled through the Bureaus Procurement office, requests for ADP-related actions will continue to require some specialized processing. The justification and acquisition approval for microcomputer equipment and off-the-shelf software and supplies totaling less than $10,000 is delegated to the Associate Director level. The ADP staff no longer is required to review and approve such purchases. When the purchase order is sent to the Procurement Office, requests for sole source and brandname purchases costing more than $500 must include a brief justification. When the purchase order is received in the Procurement Office, information copies are forwarded to the Census Microcomputer Information Center to be used to update the Census Bureau inventory of microcomputer equipment and software. MICROCOMPUTER MAINTENANCE POLICY The Census Bureau's microcomputer maintenance policy is based on cost. For every six machines purchased we purchase a spare machine because the cost of a one-year maintenance contract on the first six equals the cost of the. spare. These machines are not just stored; they are used in noncritical environments where they can be removed to replace a critical machine as needed within one hour. When a user encounters an equipment problem that is beyond the users capability to resolve, he or she contacts, our Technical Services Division (TSD) service representative who will respond by sending a technician to the user's site to isolate the cause of the equipment problem. -9- If it is something simple that the technician can repair on the spot (such as replacing a fuse, resealing a loose board, or tightening a plug), the technician will make the repair. If the problem cannot be resolved by the technician on site, the technician will telephone the CMIC to request that a replacement computer or input/output device be loaned to the user until the user's machine is repaired. TSD will set up the replacement equipment for the user (if necessary) and take away the machine that needs repair. The user should be able to resume normal operations with minimal delay, aggravation and frustration. If the device is still covered by its original warranty, TSD will arrange to have it repaired under the terms of the guarantee. If the warranty is no longer valid, TSD will arrange to take the machine to a designated dealer for a repair estimate. When the machine is left with the dealer, a hand receipt will be signed by the dealer and returned to TSD. When the dealer calls the estimate to TSD, TSD will prepare a purchase request and forward it to the user's division. The division will insert the appropriate accounting code, approve the action, place a priority flag on it, and send it to the Procurement Office. The Procurement Office will expedite all micro maintenance requests by calling the dealer with a purchase order number. When the repairs have been finished and the machine is ready for pickup, a driver will take the purchase order to the dealer and pick up the machine. This procedure is valid for any repairs totaling less than $1,000. In cases involving repair estimates in excess of $1,000, TSD will contact the microcomputer user to discuss whether the repairs should be authorized and, if so, what procedure must be followed. The loaner machine will be under the control of the CMIC with the following priorities governing their use: Top priority -- to any user where TSD has removed a machine for authorized repairs. Second priority -- for use in support of hands-on training classes sponsored by CMIC. Third priority -- for use by someone who wants to do small projects on a borrowed machine. Priority will mean exactly that A broken machine will be replaced with a loaner from the CMIC even if it means having to take the loaner away from someone who is using it under a lower priority. I want to emphasize that this is our current policy, but it can be changed very quickly. We are constantly monitoring this procedure and continually reassessing our options (i.e., outside service contract). MICROCOMPUTER TRAINING SUPPORT We established a classroom with 16 machines for hands-on-training. We did this because of the numbers of people requiring training and the cost of sending people to outside courses. The types of courses taught are: Introduction to Microcomputers, Databases, Word processing, Spreadsheets, Graphics, etc. -10- Originally there were requests for training of 3000 persons in all aspects of microcomputers. That has been reduced to approximately 2100. We believe this training demand will be high initially and then will drop off dramatically. Outside instructors have been hired to teach our classes. We have had a great deal of success with this process because of the quality of instructors acquired. To pay for this training facility we charge back directly to the user division the cost of the instructor, software purchased and maintenance cost of the classroom. This cost goes to a maximum of $125 per class, significantly cheaper than to send all people to outside training. NOTE: Many vendors sell at a small cost educational licensing agreements providing copies of their software for each machine in the classroom. some vendors will not do this; then we must purchase copies for each machine at full price. OFFICE AUTOMATION I have specifically not addressed the topic of office automation, as we are still planning and discussing exactly what office automation is going to mean at the Census Bureau. Our primary planning focus at this time is to determine what functions need to be provided Bureau- wide and what functions will be left to individual operating units. THE NATIONAL SECURITY AGENCY PERSONAL COMPUTING INFORMATION CENTER Kathy Schnaubelt, National Security Agency The National Security Agency established a Personal Computing Information Center (or PCIC for short) approximately a year and a half ago. This action was taken in response to the Agency's growing demand for personal computer products. In the year prior to the opening of the PCIC, many new personal- computer products and vendors were reaching the marketplace. A growing number of these products were in turn being purchased by a cross-section of Agency elements. This mix of products across the Agency began surfacing problems such as that of system incompatibility. This may be illustrated by, the example of a diskette of data or software running on one computer brand but not on a different brand of computer. The PCIC was designed to assist Agency personnel in the selection, acquisition and use of an established set of "standard" personal computer products. The basis for the selection of standard products was determined by the Agency's needs as a whole. One such requirement was for the UNIX operating system. Hardware selected as the Agency standard workstation would have to be able to run under the UNIX operating system. At the root of decisions of this nature was the concept of compatible hardware and software products that would be easy for people to acquire. -11- Another important concern for us was security. By going to standardization, that problem may be minimized by the selection of products that meet this requirement and then training personnel to use them. A third consideration was supportability. Maintaining a variety of microcomputers, or personal computers, can be a logistics nightmare; stocking of parts, replacing them, etc., in any number can be devasting. Finally, there is cost. By limiting the number of kinds of personal computers and software products that we use, we are able to buy large numbers of each at a lower per-unit cost. Right now we have thousands of microcomputers in the Agency, and we have plans to buy many more, which should result in a significant savings from bulk buys. The PCIC was established to meet the following objectives: 1) to promote the use of standard equipment; 2) to share an centralize our small systems resources (like everyone here, we have a limited number of people to support these products); 3) to minimize the end-user application load; 4) to maximize cost effectiveness; and 5) to centralize product registration (providing anonymity in our workplace). The PCIC has become a focal point for all Agency standard products, and to date these products include: an Agency standard terminal/workstation which is an enhanced IBM XT; the standard office automation equipment which is the WANG Professional Computer; an interim standard local area network. So there will be a family of Agency standard host computers. The PCIC provides its customers with information on all of the standard products that are available; and this includes a reference collection of books, periodicals, in-house-developed working aids, research guides, comparison charts of the capabilities of the different products, and a referral service for technical questions. It also provides demonstrations of standard products. Anyone can go down to the PCIC and use one of the standard products, whether it's hardware or software. To encourage the use of the PCIC by Agency personnel, the PCIC tries to make the acquisition of standard commercial products as simple as possible. Rather than have each office go out and do their own purchase request, an authorized individual can come into the PCIC and request commercial software. The software is actually stocked in the PCIC. We have licensed some items (like CONDOR and MICROPRO products for example). By doing that, we have actually reduced some costs by 70%. Non-standard products may still be purchased, but on a limited basis. A non-standard product must be requested in writing. This request is reviewed by a software evaluation team to determine the validity of the purchase request. When a product offers a unique capability, it is purchased and evaluated. A favorable evaluation results in the product's being added to the list of standard products. A product. which does not offer any capabilities beyond the standard product line, or in fact is defective, would be placed on a prohibited- purchase list. In any case, the PCIC still does the actual purchasing, whether it's for a standard product or, an evaluation copy of a non-standard product. This saves the requester from, the paperwork of writing a purchase request document. -12- While the purpose of the PCIC is to furnish standard products, it also functions in identifying products that meet certain minimum requirements for Agency use. These products are added to the list of standard products to provide a flexible work environment for Agency personnel. The goal is not to restrict what people do or how they do it, but to make sure that the products they use are compatible with other products used throughout the Agency. USE OF MICROCOMPUTER TECHNOLOGY AT THE BUREAU OF LABOR STATISTICS Peter Stevens, Bureau of Labor Statistics I made the discovery when putting this talk together that I could take the various displays and shuffle them and present them in almost any order I chose. I not quite sure what the conclusion from that would be, but with this heady sense of freedom, I decided to start in the middle. Therefore, the first display you see discusses a brief introduction as to where we are now. The Bureau of Labor Statistics has approximately, 100 microcomputers, almost all of them standard IBM PC/XT's (see Display 1). We also have three Ethernet Local Area Networks two in D.C. and the other in the San Francisco regional office. We have network licenses and centralized software libraries for all of these machines. This is one point, and the first of the points which I will be emphasizing, where some of the things that we are doing that are, perhaps, different from what is commonly done. Floppy disks have no essential role in the entire operation. If I had my way, I wouldn't have them. Bureau of Labor Statistics Networks and Microcomputers Where we are now Approximately 100 microcomputers in use (mostly highly modified IBM PC/XT3) Three Ethernet (FIPS 107) Local Area Networks, two in DC, the other in San Francisco. Software libraries are centralized. We are close to completing the "large scale pilot" stage of our development effort. For each application area our goal is to identify and validate quality products which can be made a part of the standard BLS microcomputing environment. Display 1. In general, the way people get software onto their machines is through local area networks from centralized storage devices. We are getting to the end of what might be called the "research phase" of this entire new technology operation. The three networks were all acquired by a competitive -13- procurement which we ran a couple of years ago and which is, in effect, a large-scale+west. That gets to the last point on Display 1, which is the basic goal for what we are trying to accomplish right now: to identify and validate quality products which can be made A part of the standard BLS microcomputing environment; then, in the next stage of our operations, to make standards for use throughout the Bureau. When I looked at Display 2, I decided I could put it up and talk about it for twenty minutes without any trouble at all because it enumerates the applications and I think that gives some scope of the project. But given the terrible time constraints that we are under, I will spare you a lot of discussion here. The following are major application areas: Word Processing Graphics Spreadsheets Statistical Analysis Data Base Management Survey Data Collection Survey Control Project Management Calendar Management Network Services, including Electronic Mail, Shared Data Management and Inter-network Routing. National Communications via Public Value-Added Networks (X.25 & FIPS 100 standards). Mainframe Communications Gateways for Interactive and Batch Operations. Access to the Local Networks from remote (usually portable) microcomputers. Display 2. However, there are two things worth pointing out. Some may know from the previous references that "FIPS" stands for Federal Information Processing Standards, which are produced by NBS and which we are trying to follow. We have more standards than FIPS 100, and those things are, in general, a significant part of our operation. One other point, before moving on here, that I think is worth some mention: applications like word processing, graphics, and spreadsheets are stantard and well known; but the applications that I call here Survey Control, Project Management, and Calendar Management get into a function for the microcomputer which I don't think has gotten the emphasis it deserves. This is a Control and Management function. In the same sense tnat a microcomputer is a useful tool to use with a project management package, It is also used and useful for keeping track of one's personal calendar and the -14- ordinary flow of activities through the division. responding to technology, this is definitely a growing area. Anyway, enough for the present. The reason for Display 3 is not so much a chance to give you the details of how the Bureau operates, but to make a point that our efforts, in these areas were started in response to a serious and well-understood operational problem that we are having. The large, centralized mainframe computer provides, in our view, a very poor, very weak environment for the general area of interactive applications. How This All Got Started Throughout the 70's the Bureaus approach to computing relied upon two large, IBM-mainframe, computer centers accessed via dial-up telephone lines. While this environment served the large-scale, batch-oriented, survey processing well, other applications were served poorly: Interactive applications were very hard to develop, and response from the mainframe computers varied widely. Data communications were a constant source of problems, especially those with our Regional Offices. The proliferation of incompatible word processing equipment caused continuing operational problem and prevented any more ambitious office automation efforts. The most promising technical approach to solving these problems was: Powerful microcomputers for interactive processing. Local Area Networks for the heaviest communications and for configuration management. Internetwork and Mainframe Gateways for extended communications. Public Data Networks for national communications. Display 3. Again, I'm sure you wouldn't like to see me stand here and cry, so I'll spare you the details of the problems we have had with data communications since the AT&T divestiture. The final point under the problem areas is again worth some emphasis. We have, I think, some thirteen odd different brands of word processors in place. None of them communicate with each other. This is a story that has -15- been, again, welltold. There was, in the Bureaus top management and operations management, a perception that this had caused us a great deal of difficulty and a very strong desire not to perpetuate that same sort of incompatibility and lack of communication in the new technology. The lower part of Display 3 shows briefly what we have selected as the technological underpinnings of the steps we are taking. Again, we could have, a long discussion on say, minicomputers versus microcomputers and the local area network services, but it is beyond the scope of this panel. will only mention that these issues were very seriously considered, and the choices listed were not made lightly. I would like to draw your attention to the phrase "configuration management." Having, let's say, several hundred microcomputers all using the same software packages would not be, in our view, sufficient to guarantee compatibility. Companies are constantly issuing new versions, and these new versions are frequently incompatible with each other. So you need not only to standardize with the level of machinery, but you need to do version control and configuration management to insure that the potential of a standard environment endures. One of the major functions of the local area network is that it makes it really possible to do this. If we wish to put up a new version of a particular procedure, we can do so. We can test it and then make that transition very easily. Back when I was planning this, I had visions of myself running down the hall with 500 floppy discs trying to distribute them. It was the horror of that nightmare that led us in that direction. Display 41 "How This All Got Started" is from a configurations perspective. I urge you not to take this too literally, but, in conjunction with Display 3, it does demonstrate the basic structure of the communications and technical environment. The large, vertical black bars indicate the local area networks themselves (that is, cable connections between machines in a single area). We use two computer centers: National Institute of Health and Optimum Systems, Inc. Those dotted lines indicate communications through the public telephone system. On the networks themselves we basically have two types of devices: The workstations (that is, machines that people use) and network services for file storage, printing, Communications, etc. Now we are at the point where we can get down to the most important part of this presentation. One of the things that I would like to try and share with you, from our experience, is an idea that I call, on Display 5, "Important Operating Assumptions." An assumption here means about the same thing that "theory" means in physics or chemistry. It means an idea that we believe and accept as true and act upon, but at the same time are constantly retesting and reevaluating. -16- [GRAPHIC] \WP1417.GIF Important Operating Assumptions No single supplier can come even close to supplying top-quality products for all our requirements. The best quality and most creative software development now is being done by independent (and frequently quite small) Software Vendors. Standards, de facto and formal, play a much more important role for the microcomputer market than they do for the mini or mainframe market. We can increase effectiveness and reduce risk by emphasizing, open systems and standards rather than by becoming locked in to one manufacturer's product line. The most reliable source of information about new products is our own testing. The selection, testing and integration of hardware and software are professionally very demanding tasks. Statisticians and economists should not have to become Microcomputer experts to use the equipment well. Quality in the initial selection of hardware and software is only the start of an effective operation. Support, maintenance, and especially release control for software are essential to long-term effectiveness. Planned and controlled redundancy is the best and, in many cases, the only way to achieve high reliability. Display 5. The first four items are a basic description of why we are interested in open systems" or open-systems interconnection. We have substantial experience with being in the tender and enveloping grasp of a single manufacturer and in discovering that manufacturer's products don't meet new needs, or that there is no way to interface some new piece of equipment to the existing equipment. THE MOST RELIABLE SOURCE OF INFORMATION ABOUT ANY PRODUCT IS OUR OWN TESTING. This point belongs in bold print because that is probably the essence of the whole project. The computer business has always been full of what I will call "hype": statements of doubtful truths, made just to sell equipment. The microcomputer business is, if anything, worse than the mainframe side of the business. We have found that things like articles and advertisements in magazines, the flowing promises of salesmen, and similar frivolities are simply not a basis upon which we can operate. We have certain responsibilities to our users in the, Bureau so that when we say something is going to work, they can expect that it will work. We can't then turn and -18- say that the salesman said it will work. Much of our validation is this testing of the product claims. The next two items on Display 5 deal with another very important aspect of our work. Doing the kind of validation that will cut through the hype is, in our view, a demanding task and not one which need be or should be placed upon the working statistician and economist. We have a very large number of users that want to use this technology. We have a much smaller number that wish to become microcomputer experts. We are trying to create an environment in which economists, statisticians, managers, clerical personnel, and the whole BLS community can use microcomputers effectively without having to go through the struggle and pain that is associated with selection, testing, and integration of the underlying technology. The last item in Display 5, I think, is very similar to the ones already expressed by Census. The way you get the reliability is through redundancy. One of the conclusions that followed from that idea is to use a standard configuration. Even though a particular machine may be intended for word processing and the machine next to it may be intended for statistical analysis, the underlying hardware will be the same. So that, if on the day the analysis is due, that particular machine decides to go out to lunch, the other machine can be used to finish the job. We are getting down toward the end, so we can summarize this by talking about the Project Goals and Current Policies (Display 6). You may remember that I mentioned there were three important problems that this research effort was attempting to address: the need to have an- environment in which we could create good interactive systems; the need to deal with our data communication flows; and a need to provide effective intercommunication between machines when used for statistical survey work, office automation, or any other purpose. Those were the goals and the motivation to start the project. They remain the goals. Every product we distribute must be thoroughly tested before full regional use. Some of the regional offices have very little background in data processing. What we put there had better work, because we don't have the travel budget to fix the mess if it doesn't. -19- Project Goals and Current Policies Project Goals: To solve the identified major problems with communications and interactive computing. To ensure that new products are thoroughly tested before being put into production system or into all Regional Offices. To open up new application areas, especially in the areas of end-u3er computing and office automation. To establish the basis for the continuing, orderly introduction of improved hardware and software. Current Policies: The selection, evaluation, procurement, and support of new products is centralized. Strong, de facto standards exist. The development of end-user applications is decentralized. The introduction of new products to Bureau production systems is closely managed. Pilot tests are required and high-level approval must be gained before production commitments are made. The emphasis on compatibility, full communications, and Bureau-wide usage is quite strong. Display 6. Finally, we see this whole technology as having opened up the potential to get into kinds of applications, that simply weren't being done at,all by any type of computer,,such as some of those personal and local organizational ones that I mentioned earlier. We now need to establish a basis so that we can continue to introduce, in an effective and orderly manner, new products and new technology that continue to pour out of the industry. From that, we have certain policies: the centralized selection, evaluation, procurement, and support of new products. There is some doubt as to whether we will be able to sustain a centralized procurement function because of some of the problems in government procurement which are beyond the scope of this presentation. In contrast to this centralization, the development of end-user applications is decentralized. That is, the way that persons use the machines for a particular personator organizational task is a matter of their judgment and their discretion. When we are talking about introducing this technology into Bureau production statistical systems, there is much stronger management control; and developments are closely watched. We insist on Bureau testing and evaluation before committing important Bureau projects to the new technology. -20- I think I have said enough about the need for compatibility. Finally, on Display 7, under the heading of Where We Are Going, there is basically more of the same. I mentioned we are getting toward the end of the large-scale research phase. We are planning to add local area networks into all eight regional offices instead of just San Francisco. We have one aspect of the Bureau which may be unique in that the Commissioner of Labor Statistics has A PC in her office. She also has one at home and uses them both. She has an intense personal interest in what I call here, "Management Communications." Through the local networks we have possibilities that we never had before. Through the research phase of this work, we have not had what I might call "traditional government procurement cost/benefit justification analysis" very much. I expect" as we move to the broader expansion of microcomputers into Bureau activities, that analyses of that nature will become important. There are many areas about procurement issues that are, at the moment, looking through a glass very darkly. Where We Are Going: As the performance of specific hardware and software products is validated, their use will be expanded to production tasks. The number of Local Area Networks will be expanded to include all Regional Offices. The communication facilities will be expanded to include Cooperating State Agencies for data collection and'survey processing. Management communications, among the Commissioner, Office Chiefs and Division Chiefs, will become increasingly important. The number of microcomputer workstations will be significantly expanded. Obsolete or ineffective equipment will be replaced by microcomputers. New hardware and software developments will be watched for possible replacements to standard products. As the new technology replaces existing equipment and applications, greater emphasis will be placed on cost/benefit justifications. Display 7. Display 8 shows where we expect to go technologically. I ask you not to take that too literally. This is not a technical model, but rather a demonstration of the way we see things getting done with each of the regions having its own network communicating to our network in Washington. -21- [GRAPHIC] \WP1422.GIF DISCUSSION Lawrence R. Cox, Bureau of the Census I will attempt to keep my comments brief so that we can have a full interchange between the speakers and the audience in proper "workshop- fashion. In proper "discussant" fashion, I will highlight what I see as the major similarities and differences among the three approaches taken, in the context of what I have learned from the presentations collectively and from my experiences at the Census Bureau. I have learned that microcomputer technology is a must for statistical programs. Automated, interlinked statistical program offices are more efficient and effective than those which are not. Users of statistical information have discovered microcomputer technology; and, so, statistical data providers have a responsibility to keep pace. Data review and analysis at its best is an interactive process between the expert data analyst and the data, supported by statistical software. Mainframe computing cannot offer these services on a large scale in a realistic manner or at a competitive price. I have learned that an organizational focus is needed to provide information and support both to management and users as this new technology becomes introduced and assimilated within the organization. We have seen that such a group can have any of several functions, depending upon organizational size,.needs, goals and objectives: -user education and handholding -repository of literature -source of hands-on experience -maintenance -training -develop and distribute product lists and recommendations -establish guidelines for microcomputer procurement, use, maintenance, training, etc. -recommend standards for microcomputer hardware, software and uses of microcomputer technology -establish and enforce such standards -aid in the procurement process -evaluate procurement requests -decide upon procurement requests -advise in the management of this new technology -play an active role in its management These functions, as I have presented them, lie on a continuum from the more passive, permissive or experimental approach to the more standardized, structured, or production-oriented approach. These needs and the management philosophies underlying them seem to me to be well-represented on that continuum by the three agencies represented here today. The free-market or laboratory approach adopted by the Census Bureau says, in effect, let's provide our diverse group of programs and users with the information necessary to begin to explore uses of microcomputer technology. Let's minimize the procurement obstacles to doing so, and let's work closely -23- with users in their applications and see what lessons are to be learned and what patterns emerge. In effect, as an organization, let's not force microcomputer hardware and software choices, but-let's closely manage and monitor several experiments and learn from each of them. At the National Security Agency, decisions were driven by the overriding need to standardize on hardware and software choices- sufficiently to allow diverse and distant groups to talk to each other and access the same data and programs, but stopped short of imposing inessential standards. Within a predefined architecture of standards, NSA users are free to experiment, to share information and to tailor choices to programmatic and individual needs. At the Bureau of Labor Statistics, the requirements for good and standard communications between offices and geographic areas were paramount. Experiments were conducted to fix upon the best choices, from which standards are to emerge. The environment is intended to be uniform and capable of supporting continuing, production-oriented work. Reflecting upon this continuum for a moment, I could equally describe it as being from user-oriented to program oriented, reflecting a progression defined in terms of the number of diverse programs and functions within these agencies which each agency seeks to address with automation at the microcomputer level. Interesting, all three organizations share several characteristics: they are not small, they deal routinely with massive amounts of data, their paramount concern is improved and broader access to their own data, their systems require mainframe-gateways or links, and they operate under strict data security requirements. However, for reasons which we have heard and others you may explore in open discussion, they have chosen three different approaches to tackling the problem of planning and managing microcomputer technology. QUESTIONS AND ANSWERS Ql: How was the Census Bureau able to acquire 500 microcomputers in a little over one year given GSA guidelines? Al (Mr. Swank): The Census Bureau did not go around GSA guidelines and standards, but worked within the existing regulations. Most procurements are off the GSA schedule. Q2: What variety does the Census Bureau have in their brands of microcomputers? A2 (Mr. Swank): Currently there are 25 different brands of microcomputers in operation at the Census Bureau. Q3: Does Census go through the "GSA microcomputer store" in procuring its microcomputers? -24- A3 (Mr. Swank): Yes, when possible. However, the GSA microcomputer store does not stock all brands, and this forces the Census Bureau to go elsewhere. Q4: Why did Census create a separate staff for microcomputers when they already had an established automatic data processing staff? A4 (Mr. Swank): The Executive Staff of the Census Bureau wanted to show support for microcomputer technology and to give it high visibility and, therefore, created the Census Microcomputer Information Center and placed it in the Director's Office. Q5: Has the Information Center taken an active role in education of upper-level management in the uses of microcomputer technology, A5 (Mr. Swank): Yes, each member of the Executive Staff has been given at least an introductory course on microcomputer usage. Q6: The presentation left several unanswered questions that should be addressed: 1. What about the lack of a management system for electronic files? 2. How are archiving and disposition of files handled? 3. What about programming for the PC's? A6 (Mr. Swank): Electronic filing systems will come in the near future. There are several such systems in existence now, but the costs are astronomical. A6 (Mr. Stevens): Software for record retention currently exists but the big problem is file retention for which very little software is available .Q7: Are the PC's at Census "stand-alone" or are they networked? A7 (Mr. Swank): Some PC's are networked others are "hardwired" to the mainframe; the majority are "stand-alones." Q8: Two questions regarding the presentations: 1. What is meant by "software standards"? 2. Some software packages need improvements, corrections, etc. In each agency, does anyone speak to the manufacturers as a representative of the agency? A8 (Mr. Stevens): "Software standards" means software standards. For example, there are at least three subcategories of word processing software, and each would have a separate software standard at BLS. A8 (Mr. Swank): Corporate licensing would be the answer. Those manufacturers that will not discuss corporate licensing have so much business they do not need to help and keep the client happy. -25- A8 (Ms. Schnaubelt): The focal point for NSA is with the vendor rather than the manufacturer. NSA has had problems with RUBIX from IBM. The smaller vendors are much more eager to get the business and give better contractual terms than the large firms. Q9: Is there a very strong recommendation from the panel for a PC information center? A9 (Dr. Cox) An independent PC information center is an absolute, necessity in a large organization. A9 (Mr. Swank) Each agency definitely needs at least a resource person if not a center. Q10 Would a small group need a PC information center? A10 (Dr. Cox): Not necessarily a center, but at least a reference person. Qll: Regarding machine-oriented versus people-oriented use of microcomputers, what would the individual agencies do for the people? What are the goals? All (Mr. Swank): At the Census Bureau, if the individual divisions have the budget, they will get the microcomputers they ordered within 3 0 days of the request. All (Ms. Schnaubelt): The goal is to have a PC on each desk. All-(Mr. Stevens): At BLS, the only drawbacks to a microcomputer on every desk are budget and procurement. Q12: With the advent of work-at-home, is there a use of portable PC's for this purpose? A12 (Dr. Cox): The major problem with portable PC's for take-home use is data security -- a large problem for each of the agencies represented. A12 (Ms. Schnaubelt): At NSA, portable microcomputers are used by executives and others, but these machines are kept "clean" (i.e., they have never had any sensitive data on them). The portables are used for training purposes only. A12 (Mr. Swank): The Census Bureau has many "checkout" machines, but some" of these are secure-machines and cannot be taken out of the building. A12 (Mr. Stevens): BLS definitely believes in the work-at-home concept and has machines for this purpose. However, precautions are taken to protect confidential data. Q13: How are services provided to field operators? A13 (Mr. Stevens): The regions do their own training on the uses of the BLS system. -26- A13 (Mr. Swank): There is a standardized configuration of microcomputer technology in each regional office with a nationwide company contracted to carry out maintenance. A13 (Ms. Schnaubelt): Data and software are transmitted world-wide by mail or other secured means of communication. -27- SESSION ON ELECTRONIC DATA DISSEMINATION SESSION SUMMARY The second session dealt with electronic data dissemination, focusing on disseminating information for use with microcomputers. While the first panel discussion focused on.how agencies use microcomputers within their own internal environments, this session deals with the impact of microcomputers on users of federal agencies' data and the possibilities for agencies to make information available for microcomputer users (that is, dissemination of data using floppy discs or through telecommunications). There are some very interesting opportunities for federal statistical agencies to use new media to provide data to users more quickly and in a form that is more highly usable than current printed methods. The three speakers will deal with these issues. The first speaker is from the National Technical Information Service (NTIS) which is primarily an archival-type agency for disseminating federal data and information. The NTIS program to disseminate data on floppy discs, the problems encountered, and the various issues surrounding this area will be discussed. The second speaker is with the Bureau of the Census and works with their telecommunications system called CENDATA. CENDATA is used to distribute perishable Census information to users. Our final speaker is from the Department of Agriculture. She will describe the current, ongoing process to implement a contract with the Martin Marietta Corporation to establish a telecommunications system for the dissemination of large databases containing agricultural information. USE OF MICROCOMPUTER DISKS TO DISSEMINATE INFORMATION Stuart Weisman, National Technical Information Service The history of the National Technical Information Service (NTIS) dates back to 1945 with the establishment of a publication board to assist in making unclassified government documents available to the private sector. The program went through various transformations, reaching its current status as an agency of the Department of Commerce in 1970. ---------- *Jay Casselberry, Energy Information Agency -29- The law creating NTIS states that NTIS is to search for, collect, classify, coordinate, integrate, record, catalog, and disseminate information. In the early 1970's, NTIS received its first machine- readable information product. In 1981 a new unit was established within NTIS to manage its product line of data base files and software. In the summer of 1984 NTIS began to sell data on floppy discs. The current NTIS machine-readable-products program contains about 10 bibliographic data bases, 300 source-text non-bibliographic data bases, 800 numeric and statistical data bases, and 1300 computer software programs. With this substantial amount of information available, NTIS began a review of procedures for disseminating information products for microcomputers. The following criteria were considered when NTIS reviewed the potential for disseminating their information products on microcomputer diskettes: -Forecasts of the number of microcomputers -Forecasts of,the primary type(s) of microcomputers being used by business and professionals -Physical size of the computer diskette -Microcomputer operating systems -In-house and/or contractor production of diskettes -Information products to be made available on diskettes -Entire and/or subsets of information files made available -Production of microcomputer software -Whether to reformat the data for use with popular data base spreadsheet formats NTIS has decided to make information products available on 5 1/4 inch diskettes for IBM and IBM-compatible microcomputers. Diskettes are produced by A contractor, and costs are determined based on the number of diskettes required. The main problems that have been encountered are in the loss or incorrect conversion of data when tapes or diskettes are produced, mishandling of diskettes during shipment, and improper use of the diskettes by customers. The way to overcome these problems is to establish procedures for checking a diskette against the original magnetic computer tape, and to instruct transportation companies and end-users on the proper handling of diskettes. In the future NTIS will consider producing information products on high density diskettes, hard discs, and, where it is practical, optical or video discs. With the future increases in microcomputers by business and professionals, NTIS is making a long-term commitment to having information products available for microcomputer users. With -the proliferation of data -30- management and analysis being done with microcomputers, NTIS recognizes the needs of this user community. Displays 9 through 16 illustrate the work of NTIS. HISTORY OF MACHINE-READABLE INFORMATION PRODUCTS Late 60's First machine-readable products arrive at NTIS Early 70's Production Group formed to process orders for machine- readable products Late 70's Concept of Product Management introduced 1981 Office of Data Base Services 1983 Video disc products from NASA 1984 Data files,available on diskette Display 9. DATA TAPES Over 1,000 Titles 32 Source Agencies 40 Titles Updated Annually 25 Titles Updated 2-6 Times a Year 15 Titles Updated monthly Remainder Updated Less than Annually Standing Orders Available Display 10. MAJOR DATA COLLECTIONS National Center for Health Statistics (NCHS) Federal Communications Commission (FCC) Energy Information Administration (ETA)/ U.S. Department of Energy National Bureau of Standards (NBS) Human Nutrition Information Service/ U.S. Department of Agriculture Defense Logistics Supply Center/ U.S. Department of Defense Federal Reserve Board (FRB) Environmental Protection Agency (EPA) Display 11. -31- DECISIONS, DECISIONS, DECISIONS!! Size: 5 1/4" vs. 8 1/2" (3 1/21, not readily available) Density: Double vs. single aided; Single vs. double-density (quad-density not readily available) MS-DOS vs. CP/M (or MS-DOS Vs. PC-DOS) Total in-house vs. contracting-out vs. in-house/out-house balance Products pre-selected vs. demand-driven selections Complete files only or subsets/extract3 Software ASCII only or various DBMS/spread3heet/fo ta Display 12. DATA DISKETTES 5 1/4" Diskettes Standard ASCII Fo-t For IBM-PC Microcomputer Unique Accession Numbers Assigned Data Tapes Converted to Diskettes Documentation Required Display 13. PLAYER RESPONSIBILITIES NTIS Contractor Source Agency Order Input & Control Create diskette master Provide master tape diskettes (with appropriate documentation) Copy tape to be used Archive Master Available for conversion consultation Ship Orders (with Duplicate Master documentation) Available for Get duplicates to NTIS consultation Available for consultation Display 14. -32- The Action Customer contacts NTIS "Available on Diskette?" YES NO 1. Price 1. Estimate price (based on #of diskettes) 2. Customer orders 2. Customer orders 3. Order to contractor 3. Copy master tape 4. Contractor duplicates master 4. Order to contractor with tape 5. Duplicate to NTIS 5. Contractor creates master diskette and duplicates master for customer order 6. NTIS mails (with documentation) 6. Duplicate to NTIS (price is to customer--overnight delivery actual # of diskettes) 7. NTIS mails (with documentation) to customer--overnight delivery Display 15. Problems Original tape ---------------- Bad tape from agency Copy tape at NTIS ---------------- NTIS error in copying tape Contractor converts ---------------- Contractor error in tape to diskette master conversion processor and duplicates master duplication process Duplicated diskettes ---------------- Problems created in sent to NTIS handling of diskettes NTIS ships diskettes ----------------(magnetic field, dropped, to customer smudge, coffee, etc.) Customer receives ----------------Customer mishandles diskettes and processes diskettes (see above) plus diskette processing Display 16. -33- CENDATA: DEVELOPMENT AND IMPLEMENTATION Barbara Aldrich, Bureau of the Census CENDATA is an information system for disseminating Bureau of the Census ("Census") information electronically. Development of CENDATA began in mid-1983 when Census decided that certain data, especially time-sensitive economic data, should be available on-line. CENDATA was developed under the guidelines that the data should be available on-line as soon as possible after release and that the system developed should be done at no cost to Census. The system was proposed as non-sole source, (i.e., not limited to only one contractor). In addition, no money was to be involved in the arrangement with any contractor, and Census was to have control over the information made available. During the entire process of developing the specifications @and establishing memoranda of understanding with qualified vendors, Department of Commerce lawyers assisted in refining the language and procedures. Census' list of qualifications for vendors wishing to access CENDATA and make the information available included: -A CENDATA user should only have to pay for time used accessing CENDATA -CENDATA should be available separate from other data bases, be clearly identified, and include the entire CENDATA package -CENDATA must be available seven days a week -A CENDATA vendor must be willing-to accept data delivery via telecommunications -A CENDATA vendor must be able to offer its users the services of national telecommunications networks -The system must be an end-use-based, user-friendly system The reasons behind the above qualifications were to: -ensure that vendors did not add hidden fees or package CENDATA with other services -enable users to use major telecommunications networks to minimize costs -obtain vendors with the capabilities to handle a large-scale data base such as CENDATA - increase dissemination of Census information products. Of the dozen vendors who have shown interest in the CENDATA system, four met the criteria established and; memoranda of understanding have been signed with two. -34- The first vendor, Dialog Information Services, went on-line with CENDATA on August 1, 1984. (Dialog is extremely prominent in the library community.) Dialog has CENDATA available using the standard menu-based system and also makes the information available in a full- text-searchable format. In-mid-October, 1984, the Glimpse Corporation made CENDATA available. Glimpse, in cooperation with the Chemical Bank of New York, markets data to the financial community. With the success achieved by the first two vendors in expanding the dissemination of Census data, Census is anticipating adding new vendors who service different sectors of the public. With the inherent advantages of CENDATA over traditional publications, Census hopes to continue to expand its user network. The primary advantages of CENDATA are the timeliness of the data and the ease of using the system. One of the first goals of CENDATA was to have sensitive economic information available within minutes after any embargo on the information is lifted. Examples of the- type of sensitive information. available are manufacturers' and shippers, orders, retail sales, housing starts, and balance of payments. Having this information available electronically assists users who are located away from Washington where the information is initially disseminated in press releases. The data are available weeks before users would receive it in published form, and it can be downloaded into a user's standard information system for review and analysis. Census also maintains an inventory of its products on CENDATA. This allows a user to quickly determine if a particular publication has been released, and, if so, the price, source, and Government Printing Office stock number. The illustrations that follow, Displays 17 through 21,.show how CENDATA has been developed for ease of use. Menus are designed to provide an inexperienced user with a choice of selections, and to move from general to the more specific. In addition, instructions are provided to help a user move through the system. THE CENDATA INTERACTIVE SYSTEM The Online Information Utility at the U.S. Census Bureau. A very small portion of the Census Bureaus vast data holdings has been included in this "information utility." Do you wish to see the CENDATA menu? If yes, enter Y or (return). If not, enter LOGOFF to end session. ?Y Display 17. -35- -- CENDATA MAIN MENUS 1 Introduction to Census Bureau Products and Services 2What's New in CENDATA 3U.S. Statistics at a Glance 4Press Releases 5Census User News 6Product Infoxmation 7CMMATA User Feedback 8 General Data 9 Agriculture Data 10 Business Data 11 Construction and Housing Data 12 Foreign Trade Data 13 Governments Data 14 International Data 15 Manufacturing Data 16 Population Data Enter item number or ? for help. ?15 Display 18. 15--MANUFACTURING 1 Introduction to the Manufacturing Statistics Program 2M3 Preliminary Report, July 1984 . . 8 Aluminum Ingot and Mill Products, June 1984 (CIR 1433-2) Enter item number or ? for help. ?2 Display 19. -36- 15.2--MX3 PRELIMINARY REPORT, JULY 1984 1 M3 Narrative Summary 2value of Manufacturers Shipments 3value of Manufacturers New Orders . . 7 Ratio of Manufacturers Inventories an Unfilled Orders to Shipments Enter item number or ? for help. ?3 Display 20. 15.2.3--August 30, 1984 TABLE 2, PART 1: VALUE OF MANUFACTURERS NEW ORDERS FOR INDUSTRY GROUPS, MARKET CATEGORIES, AND SUPPLEMENTARY SERIES --Seasonally adjusted- Monthly ( Millions of dollars) SIC Jul. Jun. May Code Industry 1984(p) 1984(r) 1984 All manufacturing industries. 192,450 190,620 193,680 Manufacturing industries with unfilled orders.............. 103,496 102,051 104,482 Durable goods industries............ 100,489 99,171 102,,256 --more- Display 21. After moving through the choices of information topics, the user is presented with the information requested. An experienced user may move through CENDATA more.quickly by specifying all parameters of its search at the same time. For example, by specifying 15.2.3 initially, all menus may be bypassed; and the user moves directly to manufacturing (15), the M-3 report (2), and specifically the value of manufacturers new orders (3). This development allows CENDATA to provide the necessary information and instructions for novice users without unduly hindering more experienced users. -37- As with any developing system, Census is soliciting comments from actual and potential users to determine possible system improvements and expansion of the data base. The primary users at the current time are economists,, industry analysts, and market researchers. Future plans are to, expand the data base with additional Census products. Upcoming products to be added are 1984 country population estimates and statistical profiles of every country in the world. With the addition of the statistical profiles, CENDATA moves into a new area since the information is from the International Data Base rather than from a publication, and the profiles are not readily available outside the system. ELECTRONIC DISSEMINATION OF PERISHABLE INFORMATION Roxanne Williams, Department of Agriculture The Department of Agriculture has as a primary function the dissemination of information about conditions related to Agriculture. The Extension Service is one way the Department uses to get information disseminated at the local level. In addition, the Department has long utilized the printed media for the dissemination of information around the nation. A few years ago, a number of agencies in the Department became dissatisfied with the print media because of the difficulty in getting information to interested parties as quickly as necessary. The agencies, acting independently, tried, electronic communication of data. Use was made of a number of commercial services such as 'DIALCOM, AGNET, and AGRADATA. DIALCOM is equivalent to an electronic bulletin board. AGNET is an on-line information system developed at the University of Nebraska. About two years ago, the Department started to have problems with the use of these services. Other information companies wanted the Department to provide the data going to existing services. They did not want to have to go to competitors for the information for a variety of reasons. One reason was that they wanted to be able to say they obtained the data directly from the USDA. Supplying each potential vendor with USDA data was just too much of a burden for the Department. In order to continue to get data to the ultimate end user and at the same time meet the needs of commercial vendors, it was decided to establish a single department-wide .system of electronic data dissemination. No agency will be forced to use this system; but if an agency decides to use electronic media, it must use the Department's system. This central system will then service the commercial vendors, including DIALCOM, AGNET, and AGRADATA. The Department decided to limit the scope of the project to what we call "time-sensitive perishable data." One example of this type of data is the agriculture marketing reports. These Are perishable because they contain the current prices and the current sales of all the different commodities around the country. The data are in constant demand and they are constantly changing as new reports arrive continuously. The demand for the quick and timely dissemination of these data is very high. -38- The Department is utilizing commercial vendor, Martin Marietta Corporation, to provide this service. This maintains a Department policy of not allowing public access to the Department's computer. It also keeps the Department from establishing a service that can be adequately provided by the private sector. Martin Marietta acts as an agent of the Department and has agreed not to use its position in order to benefit itself in the dissemination of these data to ultimate users. Martin Marietta can only disseminate these data through the system established for the Department. Other commercial vendors (we call them Level I users) can tie into the system with auto-dial or auto-set facilities. For a price, they can even have the main system's computer call their computer as soon as data ate released and transfer those data immediately. Thus all vendors will have excellent and "equal" access to USDA's perishable data. Equal access also meant to us that Martin Marietta would not charge other commercial vendors outrageous prices for access to the system. We wanted to keep the costs to Level I users reasonable. Martin Marietta was very reasonable and agreed to modest and uniform charges. Ease of access was also important to the Department. In order to maintain simplicity and keep programming costs low, we decided to use a straightforward file structure for the data with access obtained through a menu-driven system. The resulting simplicity of the system- not only makes -for easy access by users, but it also allows originating offices within the Department to upload files with a minimum of effort. Further, the originating offices maintain complete control over their own data in the system. They determine when data go into the system, when they are to be released, and when they are to be deleted. Martin Marietta only maintains the hardware and software of the system. In addition to meeting the requirements of outside (Level I) users, the system has been designed to the Department's own intern al requirements for information. A second type of user (Level II) has been defined. Level II users are primarily offices within the Department and the Extension Service. Other Federal agencies which make heavy use of agriculture data will be included. In order to service the Level II users, we asked Martin Marietta to allow access to smaller segments of data. These users do not need to obtain bulk data by telecommunications. The system allows us to break down bulk reports into smaller segments all of which are accessible via simple menus. The Department anticipates that the effect of the new system will be manifold. Users should have much better access to a wider range of information. Internal communication of information within the Department should improve significantly. The demand for hard copy should be significantly reduced. All of these effects should help to reduce the cost to the Department of data dissemination. -39- QUESTIONS AND ANSWERS Ql: What were the particular problems with mailing floppy discs; what kind of reject rates were encountered; and, if the discs are used for data transfer, how much of a backup do you need? Al (Mr. Weisman): Some problems in handling of the discs during shipment may have been avoided because we chose to use an overnight delivery service instead of the Postal Service. The quality of the service has been very high, there is very little handling required, and the service has not failed yet. Q2: Did you mention that there were some bad discs that needed to be replaced? A2 (Mr. Weisman): Yes. It is very difficult to track down where the mishandling of discs actually occurred. Q3: Is there a flat percentage of reliability? A3 (Mr. Weisman): The percentage of problems is very small, but it does occur. Q4: Has NTIS considered direct phone transmission of data; that is, could users call directly to the NTIS computer, similar to commercial data bases? A4 (Mr. Weisman): We did make our bibliographic data bases available similar towhat Census is now doing (as mentioned in the talk by Barbara Aldrich). That was started around 1974, or perhaps earlier. I believe there are now four vendors carrying our data base. In addition, NTIS encourages vendors to carry its statistical files and source files. To date, no vendor has elected to carry these files because it is more difficult to carry these files than a bibliographic data base. NTIS has no plans at this time to make these files available through telecommunications. Q5: What are the plans for disseminating data from the 1990 decennial census? A5 (Ms. Aldrich): In terms of data dissemination for 1990 decennial census data using CENDATA, there are no solid plans, but it is an issue for thought. The product information section of CENDATA could be used as a daily update or product release for 1990. I believe that there will be some electronic dissemination, but the amount and the level are not really being addressed at this time. Q6: Please tell us more about the software available through NTIS; is it public-domain software, software that the agencies have written for their own use, or some other type of software? A6 (Mr. Weisman): While I am the manager for data files and data bases and there is a separate product manager for software, I will try to answer your question. The criteria that NTIS uses for handling software are the same as, those used, for data files; that is, the software must be Government-produced. the software must also have a common usage and be useful to others. -40- Q7: NTIS currently sells a catalog of public domain software for $40 that includes quite a lot of information. Why doesn't NTIS publish separate catalogs of microcomputer software and mainframe software? A7 ( Mr. Weisman): At the present time NTIS only has three packages available on diskettes for microcomputers, the rest are for mainframes. NTIS does not convert software at the present time and may never do so. Currently there are not enough diskettes available for microcomputers to justify a separate catalog. Q8: Does Census have any feedback from CENDATA, users on the services and charges? AB (M. Aldrich): Yes, based on discussions with users, the charges seem reasonable. DIALOG priced CENDATA at $36 per hour, their most inexpensive commercial rate. That price does not include the telecommunications network charge which, with discount, is generally about $6 per hour. The Chemical Bank version of CENDATA is priced at $28 per hour and includes the telecommunications charge. In addition to the positive feedback we are receiving on prices, we receive feedback on what is in CENDATA, what users would like to see in CENDATA, and what they do not like. Q9: Is it possible to download CENDATA data and create other data files based on this? A9 (Ms. Aldrich): CENDATA is all public domain and no part is copyrighted. Therefore, it is available for users to download to their computers or add to other data bases. This caused a slight problem with DIALOG because so many of their data bases are copyrighted. To end any confusion, a notice was put in the DIALOG newsletter pointing out that CENDATA is in the public domain. Q10 (Mr. Berkman): Would Barbara and Roxanne discuss the impact upon their particular agencies' personnel who generate the data, in transferring the data to the two systems they discussed? A10 (Ms. Aldrich): I would like to cover the impact in two areas: the positives and the negatives. The negative for the people generating the data is that they must provide it to us in machine-readable form, either in the appropriate kind of floppy disc or via telecommunications to our microprocessor. There are some guidelines, with respect to designing tables that must be followed, which are quite difficult. The industry standard for CRT screens is 80 characters across, so any table must be defined in 75 characters since the vendors requested five characters for control. Often tables are split vertically, with the first part becoming Table 1, Part A; then the second part is Table 1, Part B; and so forth. The positive advantage to people preparing time-sensitive information and providing the data to CENDATA is a reduction in the interruptions from outside the agency with requests for data. Prior to CENDATA, when a data embargo was lifted, staff members would spend the remainder of the day answering the telephones and reading data over the phone. With the advent of CENDATA, users have an alternative where they can quickly receive the data. They can copy the data from CENDATA to their microcomputers and eliminate the need to listen to it over the phone and record it. There are both positives and negatives to the individuals who provide CENDATA with the information. In all cases the -41- individual division which is the source of the data provides the CENDATA staff with the information. A10 (Ms. Williams): Agriculture has designed a system whereby each agency retains control over its own data. This is a very sensitive subject, so the system was designed so that each agency enters its own data into the system. Because of the wide variety of equipment used to process data and create reports by our agencies, the system also needed to be designed so that the agencies did not need to change their current methods of doing business. To accommodate the agencies, each agency only needs to put a header card on its report to identify the report. If a report is to be broken up into different levels of service, an additional header card is necessary. Based on the header card(s), the system knows how to handle the report that follows. one agency, the Agriculture Marketing Service, required another accommodation because it used a leased wire service with a special protocol. Current users of these data had taps on the wire which were usually linked .to teletype machines. A microcomputer system was placed between their system and our system to convert the protocol and place the headers on the data. This allowed their system to operate exactly as it did prior to development of our system. Q11: Does CENDATA provide a computer tape to its vendors or is data communicated via telec ications? Also, how often are the vendors' files updated? All (Ms Aldrich): All CENDATA are transmitted via telecommunications. We use an enhanced word processor with telecommunications capabilities. Information initially goes into a private file where it is integrated into our standard system. We review the system exactly as a user would see it and,determine if there are any problems. Simple problems are corrected using the vendor's editor; serious problems may be corrected by deleting the file and starting over. When we give the go-ahead, the data become available on the vendors' systems. On DIALOG the files ate brought up overnight so the data becomes available the next day. We update daily based on data to be made available and changes in our product listings. The update is controlled by a vendor's software. We move records into and out of their systems. Q12: Does the Bureau of the Census pay for the update costs? A12 (Ms. Aldrich): No. Census developed the menu. We work closely with the software design people at each vendor. Q13: Do the vendors limit the amount of information? A14 (Ms. Aldrich) Certainly not in the case of DIALOG. They have the philosophy that however much information you can give them they will accept it. They consider data storage to be cheap and pride, themselves on being one of the largest vendors. In the case of Chemical Bank, they have not constrained us either. About once a year they request for planning purposes an estimate of how much storage we will need in the next two years. We have a small amount of data available on-line with a rich potential for it out of hand, but thus far there are no problems. -42- Q14: What were the reasons Census decided not to go sole source? A14 (Ms. Aldrich): One of the primary reasons was our objective to get the system operational as quickly as possible. By offering it to several vendors, we could avoid the procurement process. Another appeal was that by going with several vendors, CENDTA would be available to different segments of the community. With different vendors it might be possible to reach users that previously had not been Census data users. I think that in the case of DIALOG we have found a lot of librarians who were not previously users. Q15: Has meeting the different protocol requirements of the different vendors involved much extra work? A!5 (Ms. Aldrich): No, because we have only one system and one format for the data; each vendor must agree to adapt that format to whatever they see fit to use. There is one set of, codes which are very simple and straightforward. 43 - SESSION ON APPLICATIONS SESSION SUMMARY The relatively recent emergence of powerful microcomputers (micros) coupled with the availability of specialized vendor software packages for micros has significantly enhanced the federal statistical community's ability to gather, manipulate and analyze data. Today, more than ever, it has become easier to perform data analyses previously considered to be impractical due to resource and time limitations associated with traditional manual and computer methodologies. Accompanying enhanced analytical capabilities have improved methods for communicating the results of our data analyses. Powerful graphics software along with improved graphics plotters and color displays have made it possible to easily paint pictures reflecting data. analyses, which before were only possible through relatively expensive and involved mainframe processing. The boom in microcomputer usage in the areas of statistical and economic analyses is due in large part to the many advantages micros have over mini and mainframe computers. In particular, today's micros have storage capacities and processing speeds which often exceed mainframe capabilities commonly found just 10 years ago. Micros are generally simpler and easier to use than minis and mainframes; they are often portable; and they cost less to procure, operate and maintain. Micros are usually more reliable (less down time), and they often possess the ability to.communicate with minis and mainframes, which permits micros to access and transfer large data files. Along with the "hardware" advantages, there are also "software" advantages associated with micros. In particular, there is an abundance of high quality and user-friendly vendor software packages available, many of which permit the user to add his or her own code to modify and enhance the package's capabilities. Relative to mini and mainframe costs, these software packages are inexpensive. A few disadvantages of micros should be mentioned as well. The ability to exercise security measures and ensure control appear to be more limited. Today's micros are slow in comparison to current state- of-the-art mainframes. There exist serious compatibility problems of file structures between vendor software packages. Finally, there is often an added personal cost to the micro user in the area of additional time spent in procurement and maintenance, since these activities are usually not required of a mainframe user. The discussions which follow address many of the issues mentioned ---- above. ------------------ *Thomas Nagle, Internal Revenue Service -45- SPREADSHEET AND STATISTICAL/ECONOMETRIC APPLICATIONS IN ECONOMETRIC RESEARCH Linda P. Atkinson, U. S. Department of Agriculture Microcomputers are in widespread use throughout the Economic Research Service (ERS). I will be discussing their application not by secretarial staff for word processing or by data processing professionals, but rather by the economic research staff themselves. Our economists first became involved with microcomputers through the use of spreadsheet software, and this is still where the bulk of the applications are. Packages such as Supercalc and Lotus 1-2-3 are used extensively for data preparation, developing tabular reports, producing high-quality charts, graphs, and plots, performing if-then analyses, and interfacing with mainframe software. Some of the systems which have been developed with these packages are, in fact, quite sophisticated. One group, for example, has developed a program using Lotus 1-2-3 to assess preliminary economic impact of foreign pests to producers, consumers, and society in general. A partial budget analysis is used in which different economic scenarios are simulated by allowing changes in costs of production, yield, and prices for the affected crops. The entire system is menu driven and has options for various tables and graphs which can be produced. The program set-up is-being used as a template from which similar analyses can be developed, such as a program to evaluate the impact of change in ozone concentrations on yields. Another group hail been using Supercalc for data entry and preparatory calculations before running a program on the microcomputer to convert the data to-the form required for input to mainframe packages such as TROLL or SAS. After running these mainframe programs, files of output were then transmitted back to the microcomputer and reformulated for spreadsheet entry so that tables and graphs of output were automatically generated. Additional changes in the form of model output results could then be made, interfacing the flexibility of the microcomputer with the calculating power of the mainframe computer. Now this group has a simplified version of their model, the world grain-oilseeds-livestock (GOL) trade model, running entirely on the micro in Supercalc. The GOL model is an annual simulation model consisting of 27 country and regional models and 20 major agricultural commodities. The individual models are linked to solve simultaneously for a vector of prices which clear world trade. The global model system has equations for 339 country-commodity combinations. Running a 20-year projection on the full linked model on an IBM PC/XT took 48 hours; however, an individual country model runs in about 15 minutes. They hope to improve speed considerably by the acquisition of an IBM PC/AT with memory upgrades. The program has been. set up to ask questions of the user, such as what country is to be analyzed for what start and end dates. Users like the flexibility of the spreadsheet format; one can.get in and look at a simulation, watch the numbers change and see where any problems are. Built-in equation writers allow you to change the structure of a model or you can edit it directly. You can pre-create graphs and have them contain historical data to compare to simulated results. -46- A good reference on building such models in spreadsheets is an article from the February 1985 issue of Byte magazine entitled "Simultaneous Equations with Lotus 1-2-3." The author demonstrates how to formulate and solve a famous macroeconomic model, Klein's Model I, using standard Lotus commands. The Gauss-Seidel iterative method is used to numerically solve the system, with a one-line Lotus macro written to test for convergence. Another example of Supercalc use is to make projections of coarse grain production in foreign countries using population projections, real GNP growth rate, elasticities of consumption with respect to income, and growth rates of production. The spreadsheet format allows the analyst to change one item, such as an elasticity and have everything else recalculated. In this way it becomes easy to cross- check to see if implications of certain assumptions are reasonable. A planned enhancement to this analysis technique is to begin to use the regression capabilities of a microcomputer statistical package, ABSTAT specifically. Regression of grain conversions over time can yield estimated elasticities, which can then be put back into the spreadsheet. ABSTAT was acquired as a user-friendly package to do basic descriptive statistics and simple linear regressions. We have also acquired SPSS/PC, the micro version of the popular mainframe package. Many of our economists are accustomed to using SPSS for analyzing survey data and large cross-sectional data files such as those provided by the Census Bureau. To provide databanking of larger files of which portions might be analyzed using SPSS/PC, we recently licensed SPSS/X to run on our in-house minicomputer. SPSS/PC's ability to handle "portable" system files which can be uploaded and downloaded easily aids in forming an interface between the large and small computers. We will first apply this in analyzing the results of an in-house information-needs, survey; complete questionnaire results can be stored on the minicomputer, with data for particular groups of respondents or selected variables downloaded to the micro for detailed analysis without having to be redefined. We have two packages in-house that can perform more complex econometric estimation techniques: RATS (Regression Analysis of Time Series) and SORITEC. A domestic sugar model has been set up in SORITEC. Varioust estimations were performed, including OLS and two- stage least squares and Cochrane-Orcutt autocorrelation correction for each equation. The model was too large at 15 equations for SORITEC to do maximum-likelihood estimation of it, but the new version, when it comes, should be able to handle it. The model was simulated in SORITEC with the various sets of coefficients and also with various changes made to the model, for example perturbing an exogenous variable by 10% SORITEC has a command to compare actual and fitted values, computing summary statistics to measure goodness-of-fit. Because the model is somewhat large, it is run in a "batch" mode with Wordstar used to edit the SORITEC program. The model has also been put up on Lotus 1-2-3 to experiment with the parameters. Graphwriter is used to output plots of results. There is a free version of SORITEC called SORITEC Sampler which has capabilities of the main package up through, two-stage least squares. It cannot perform three-stage least squares maximum-likelihood estimation or -47- handle-nonlinear models. It produces nice screen graphics of regression plots including residuals, which can be dumped to a line printer (but not at present to a plotting device). While not of publication quality, the plots are very useful for analytical work. For example, as part of a farm production model, an equation was estimated with prices paid by farmers for feed as a function of corn price and the price,of soybean meal. The residuals showed some problems; an autocorrelation correction was tried and the regression re-estimated. The new plot showed substantial improvement in the residual analysis. Another analyst uses RATS to estimate import demand for wheat, corn and soybeans in four Asian countries. The 10-equation model has been run through OLS, instrumental variables and Taylor-series approximations, and he is trying to get around memory constraints (supposedly temporary until the new release of the package) to do seemingly unrelated regressions. The ARIMA time-series analysis capabilities of RATS were used in this project in determining how to average prices on a yearly basis" looking at the cross-covariances between prices and imports to decide on a lag structure. RATS is also being used to estimate a Canadian grains and rapeseed model. Again, a spreadsheet, in this case Lotus, is being used to update the data and provide graphical output, as well as to simulate the results. We have at ERS a number of other software packages for microcomputers to perform more specialized functions. GAUSS is a matrix programming language that allows you to write out an analysis the way you would write it mathematically. You can easily write down the estimation commands for the coefficients of a simple linear model, or the code for a complex statistical algorithm as it appears in a journal article. GAUSS does not currently come with built-in statistical routines but is planned to in the future. Another program, TK!Solver, solves simultaneous nonlinear systems, again allowing you to express the equations similarly to how you would mathematically. A package called MUMATH solves mathematical problems. symbolically land can take derivatives, etc. Especially useful in macroeconomic theory, one can change coefficients or other aspects of a model symbolically rather than numerically and see the logical implications in terms of cross-relationships that result. We even have some researchers who use small programs written in Basic to perform a specific statistical function, such as regression or the calculation of standard deviations or coefficients of variation, rather than bother learning how to use a more complete statistical package. Finally, I would like to mention one macroeconomic model to which ERS subscribes, FAIRMODEL which is a model-of the U.S. economy developed by Professor Ray Fair of Yale University-and programmed for the IBM PC and XT. The model consists of 30 stochastic equations and 98 identities and is re-estimated quarterly. It can be used for forecasting, policy analysis, scenario development and as a research tool. An analyst can run experiments change exogenous assumptions, enter adjustment factors, or exogenize an equation or block of equations, and view the results. An interface to Lotus 1-2-3 can be obtained with FAIRMODEL to use for setting up an analysis and deriving tables and graphs from the model output. --48- These have been only a few of the very many applications of microcomputers that we have in-house. The use of microcomputers has revolutionized the way our analysts conduct their research. In the area of econometric modeling, many more alternatives can be considered and assumptions tested in a much shorter period of time,, taking advantage of the interactive nature of the software oh these machines. Researchers who in some cases had little computer experience previously have become proficient with the easy-to-use and flexible software available on microcomputers, particularly spreadsheets, and seem to prefer this to the use of cumbersome statistical packages. However, now that better statistical software is becoming available, interest in it is growing. The economists I spoke with seemed to want to choose their, own components of an analysis system - spreadsheet, statistical program, graphics package, word processor - and are concerned with having good interfaces so they can quickly move data from one program to another. Some problems with memory constraints and speed have been experienced, but hardware is rapidly improving to alleviate this. There are worries about having errors creep into programs, especially with spreadsheets that may not be well documented and might be passed from one researcher to another. These and security of data issues will have to be addressed now by analysts who perhaps had that taken care of for them in a mainframe environment, but this seems to be a fair trade for the ability to interact directly with their models and better understand what the data are saying. Reference Johansson, Jan-Henrik, "simultaneous Equations with Lotus 1-2-3," Byte, February 1985, p. 399. Acknowledgments Many thanks to the following researchers who shared the results of their work on microcomputers: Walter Ferguson, Vernon Roningen, Michael Lopez, David Weisblat, Suchada Langley, Gary Lucier, Carlos Arnade, Larry Deaton, Clark Edwards Paul Prentice, and Merv Yetley. Software Vendors AbStat SORITEC Ander-Bell Sorites Group, Inc. P.O. Box 191 P.O. Box 340 Canon City, Co Springfield, VA 22151 81212 Graphwriter SPSS/PC Graphic Communications, Inc SPSS, Inc. 200 Fifth Avenue 444 Michigan Ave. Walthham, Ma 02254 Chicago, Il 60611 (312) 329-2400 -49- Lotus 1-2-3 SuperCalc 3 Lotus Development Corporation SORCIM/IUS Micro Software 16l First Street 2195 Fortune Drive Cambridge, Ma 02142 San Jose, Ca 95131 (408) 942-1727 MUMATH TK!Solver Microsoft Corporation Software Arts, Inc. 10700 Northrup Way 27 Mica Lane Bellview, Wa 98604 Wellesley, MA 02181 RATS FAIRMODEL VAR Econometrics Urban Systems Research & 134 Prospect Ave. Engineering Minneapolis, Mn 55419 2067 Massachusetts Avenue (617) 661-1550 Cambridge, Ma 02138 SPREADSHMT AND DATA BASE APPLICATIONS USED BY THE CROP REPORTING BOARD IN REVIEWING SURVEY INDICATIONS AND PREPARING PUBLICATIONS Gary Nelson, U. S. Department of Agriculture Our Agency, the National Agricultural Statistics Service, is responsible for gathering crop and livestock statistics for the Department of Agriculture. We make forecasts of the crop size during the growing season, and final, estimates at the end of the year. We have a network of 44 field offices serving all fifty states. These field offices regularly survey thousands of operators of fa=s, ranches and agriculture businesses to gather information about their operation. Statisticians in our field offices assemble the information and make recommendations on such items as acres planted or harvested, yield per acre or the amount of grain that is in storage. They then send indications and recommendations to our headquarters office in Washington, D. C. where the data are assembled and reviewed, and U. S. estimates are set and published. Our state offices are connected to a large computer network, the Martin Marietta Data System. The indications, recommendations and comments are submitted over the network to our office in Washington, D.C. We have several IBM PC/XTIS in our section, which we utilize extensively for summarizing data and weighting the data to give state, regional and national totals, as well. as designing questionnaires and various othee spreadsheet applications and some graphic applications. One microcomputer application that we have developed is called the Grain Stocks Program. This program produces a report that we release four times a year, that shows the amount of grain that is in storage, both on the fa= and what is stored off the farm. The report was produced manually in the past and we wanted to put it on the micros. In designing this application, we wanted a system that would: be easy to use, be menu driven, be able to download the data from the data base on Martin Marietta to the microcomputers; assemble the data, provide a means for making changes, provide us with summaries and camera copy that we could use to print the report, provide the ability to transmit the changes back to the data base at -50- Martin Marietta, and provide the capability to compute and print a balance sheet. The program uses a combination of Condor and SuperCalc3. Another application of the micros is in tabulating and charting data used in making forecasts on the size of the various crops each month throughout the growing season. These forecasts are released on a specific day each month. Since the forecast of the size of the crop can have a definite impact on the prices, it is extremely important that strict security be maintained in compiling these statistics until the report is released to the general public. To insure that the data are kept confidential, we operate under a "lockup" procedure. The members of the Board review the data, read charts, and recommend a yield for each State and the Region. The Board then jointly agrees on a yield for each State to give the U.S. totals. The biggest use of the micros in this application has been to assemble the data to a Regional level and at the same time provide printed worksheets to the Board members for setting the estimates. We usually have less than one hour to prepare the data for review by the Board. I can enter the indications on .the PC, and within about five minutes print out the spreadsheet with all the indications on it. In the past it would take almost 6 ne hour with two or three people doing the calculations and checking the totals manually to complete these tasks. Furthermore, these -time savings permit extra time for reviewing the data and ensuring they are correct. In conclusion, we find ourselves putting almost all of our calculations on spreadsheets, and even people who have little experience on computers are able to effectively use the micros. In most cases there has been a considerable time savings, coupled with improved data quality. MANAGER'S PERSPECTIVE ON THE ACQUISITION AND USE OF MICROCOMPUTER-BASED GRAPHICS PACKAGES Richard W. Bays, Internal Revenue Service The capability to display statistical data graphically as opposed to tabularly has been greatly enhanced with the advent of graphics software packages which can be used on microcomputer equipment. This paper summarizes the experiences of one small statistically-oriented organization, the Projections and Forecasting Group, a component of the IRS Research Division, in using microcomputer-based graphics to upgrade the quality and impact of its products. Mission of the Projections and Forecasting Group (PFG) Until 1983/1984, the Group's projection activities were completely mainframe bound. All consequential projections were performed at a remote IRS -computer facility in a dumb-terminal, time-sharing mode. There was no graphics capability in this system. Even tabular information was difficult to extract in a format which was ready for camera-copy reproduction. The introduction of 16-bit 10 MB hard-disk micros into the Group in early 1984 radically altered work processes within six months: -51- -- Lotus 1-2-3 spreadsheet software was used to format smaller projection projects. -- A downloading- capability was created so that large-scale computations could be done on a mainframe, with numbers dumped into preformatted tables. -- A variety of different tables were created which allowed more rapid scanning for errors or problems in projections. -- Data transmission arrangements were made with key users so that data previously supplied in hardcopy only could be provided electronically, thereby facilitating further analysis by the user without data re-entry. -- Experiments with Lotus 1-2-3 graphics suggested that much could be done to present analytical information and projection highlights in pictures rather than words or spreadsheets. Graphic representation of data would expand the managerial and executive audience. Graphics Experimentation Early experiments in presentations were done using Lotus graphics and a dot matrix printer. The Group found Lotus graphics satisfactory but limited in the quality of presentation both in terms of sharpness for reproduction purposes (a printing problem) and sophistication (a software problem). The problem of quality reproduction was solved by acquiring a six-pen Hewlett-Packard plotter. Tests and discussions with other organizations showed eight-pen plotters to be too slow, too complicated and too expensive. The second problem, sophistication and flexibility of presentation, necessitated a software survey. A number of different packages were reviewed against survey criteria: -- compatibility with Lotus 1-2-3 files, -- menu structure, -- equipment compatibility, and -- memory demands. Chart Star, software marketed by Micro Pro, Inc., was chosen at the end of this survey. Chart Star has a wide range of charts and graphics to choose from and is in all ways superior to Lotus 1-2-3 graphics. For example, bar graphics can be three dimensional; it has exploding pie-charts, and a number of other options, all prefaced with easy-to-use menus. With both hardware and software in place, the Group began to routinely use graphical data representations in its reports and documents. We discovered that once it was demonstrated what microcomputer graphics could do, demand for such presentations increased exponentially. Consequently, the Group added to its repertoire a software package called Statmap which permits shaded/cross-hatched representations of data on maps at the zip code, county, stateand U.S level. -52- Some observations on Impact There is no question that microcomputer graphics have greatly improved the quality of and increased the audience for Projections and Forecasting work productions. Managers and executives-are more aware of-key trends and have tended to ask for additional data and displays on them. There are organizational impacts, however. Search time Finding the hardware and software which suits the presentation requirements of the organization requires time--staff time. Hardware and software specifications need to be reviewed, demonstrations arranged and procurement initiated. Getting the right technology for your needs requires carving out enough time from everyday-work to do an adequate job of review. Implementation - Training employees to use graphics software is not usually a major issue. However, selecting the right graphics to demonstrate the point in question is a more substantive issue. Doing so requires consultation and testing and adds to production time. The longer the review chain, the more frequent will be requests to alter graphic presentations or present data,in some other manner. Integration - Good graphics create their own demand. We found that top managers expect textual material with high data content to be graphically illustrated. We have not found software which does a good job of integrating text and graphics for camera-copy development. This means graphics and text are separately produced, then cut and pasted into camera copy. The result is that making changes becomes more difficult than simply making textual adjustments, on a word processor. Color - All good graphics packages can give video displays of charts and graphics in color. With a plotter, camera copy can also be produced in color as can overhead projections. The rub develops in moving from camera copy to production. Few organizations have the color xerography necessary to make color reproductions, although this may be coming. For the interim, graphic presentations need to be developed with a black and white final product in mind. Competition - Good analysts quickly realize that graphic data representations help sell their products. Consequently, there is competition for the use of both equipment and software. If the organization has micros with either built-in or external hard disks, the equipment side of the equation can be solved by loading software into the hard disk. The number of software packages needed will depend on the volume output of the Group. In our case, during peak production, one package for five, analysts seems to meet the need. Supervisors and/or reviewers need also to guard against "over illustration," a problem which can occur once analysts have seen the power of graphical presentations. -53- CURRENT APPLICATIONS OF UNIX-BASED MICROCOMPUTER SYSTEMS Brian Carney, U.S. Department of Agriculture The situation in the National Agricultural Statistics Service-(NASS) is unusual in that several of our microcomputers are based on the UNIX operating system. Instead of having just one user on a machine, we have multiple users and multiple tasks per user. First, a little about what the Research Division of NASS does. There are three branches in the division: Remote Sensing, Yield Research, and Sampling Frames and Survey Research. The Remote Sensing Branch uses a UNIX system with exotic graphics hardware for satellite image processing. The Yield Research Branch is using a UNIX system for editing programs that are submitted to a mainframe computer. The Sampling Frames and Survey Research Branch, the group I am in, works in three general areas: nonsampling error research, area-frame design and construction, survey design and analysis, and statistical consulting. Much of the work of the latter branch involves large datasets from our agricultural surveys and requires the use of a statistical package on a mainframe. We use SAS primarily. Efforts to reduce nonsampling errors in telephone interviews is what led to our use of UNIX. We became research partners with the Center for Computer Assisted Survey Methods at the University of California at Berkeley. That group had been working on a system for computer- assisted telephone interviewing (CATI). Using the CATI system, we replace paper questionnaires with a terminal, and the interviewer can enter respondents replies directly into the computer. Error checking is performed while the respondent is still on the telephone. CATI was developed and runs under the UNIX operating system. For a while back in 1982, when we were just starting our work with CATI, we tried connecting the terminals over telephone lines at 1200 baud; and at that slow data rate it takes a while to paint the screen, delaying the interview. Shortly thereafter, some multiuser microcomputers became available that ran UNIX; and the CATI system was ported over to them easily. The cost of the system was not too bad, something under $40,000; so we were able to procure an inhouse system for a work group of about ten people. That was our introduction to UNIX. Once we had installed the machines, it was clear we could do quite a bit more with them besides run CATI. what we have done falls into the category of analysts' support. There is a video display terminal on each desk, with access to the mainframe systems both interactively and in batch mode. Programs for analyses are written on the UNIX system, using the native full-screen editor, and are transmitted to the mainframe for execution. This avoids the cost of being online to the mainframe. By dialing in to other systems or having them dial in to the UNIX systems, we can transfer information on the word processors and PC's in and out of UNIX. The electronic mail system has been very useful to the managers; and some of the material usually covered in staff meetings is now mailed to the staff using UNIX, and they can read it at their convenience. The electronic mail system extends to the research UNIX systems in Washington and the field. -54- Of course, UNIX has tools to facilitate programming, technical writing and publishing; and these are widely used on our UNIX systems. All this capability is right there under UNIX; many operating systems are not so rich. The communications capability of these systems is such that they are accessible-from remote dial-in terminals in the same way as mainframes and minicomputers. I mentioned the specific use of these systems for CATI research. The Agency, because of the interest in making CATI operational, has procured twenty to twenty-two UNIX-based machines for our field offices. Besides running CATI these will be used for direct data entry, transcribing the responses to paper questionnaires we still generate in field interviews. Software costs have generally been lower since one package is purchase per system. The individual price is high, but is generally cheaper than PC software on a per-user basis. I have mentioned office automation. All the analysts have a CRT on their desk, and prepare reports and analyses through that terminal. We can edit and review before the manuscripts hit hard copy. This can be a real time saver. There are several spreadsheets available A number of imaginative simulations have been done on spreadsheets under UNIX by our group. Database software is available, but is used now primarily for administrative purposes. The statistical analysis capability under UNIX is a limitation right now. Probably one of the most complete statistical. languages available is the AT&T Bell Labs. However, it is very large and does not run on every UNIX system because of certain hardware and memory requirements. P-Stat and Minitab both run under UNIX, but would have to be converted to run under specific systems. SAS might run someday under UNIX, but probably not for a while. The SORITEC system that Linda Atkinson mentioned is available, too, and is good for econometric analyses. A system called UNIX/STAT is available for basic statistics and psychometric analyses. Because of these limitations, we do not do, much statistical analysis directly on the UNIX systems. Now a little about UNIX itself. Among its disadvantages is its size: it requires as much as twelve megabytes of disk for the operating system and utilities, of which there are hundreds. The system can appear to be quite complicated, and it is usually necessary to have someone available to help out with solving system problems. The commands are a bit terse, most only two to four characters long, and that can be a problem for new users. Among the advantages of UNIX are its flexibility and power. You have an operating system that operates on minicomputers, mainframes and microcomputers and has the same essential capabilities across them all. UNIX is powerful because you can do extremely complicated functions with a very small number of keystrokes. The multitasking means that a user can have several programs operating simultaneously. When I use viewgraphs, I -55- set each one up interactively, then have the system actually draw the viewgraph in the background. While it was drawing, I could go on to the next viewgraph. There are hundreds of utilities native to UNIX that are useful for the full range of tasks from text processing to database to programming. The UNIX hierarchical file structure is important for managing large numbers of files. Some of the new UNIX systems feature displays with multiple windows that can run different processes at the same time. For applications requiring detailed graphics or typesetting, several new systems use bitmapped screens. What you see is what you get, even to the different fonts, special characters, and drawings. The results, printed on a laser printer, are quite good. It is not unlike the Macintosh, but with a more substantial operating system. Decision support on a microcomputer is the idea of having all the functions an analyst needs for assembling, analyzing, and presenting data, in both graphic and text form. The systems are flexible enough to manage both text and graphics in the same files. UNIX is also developing sophisticated networking to allow shared access to file systems, but with separate processors available to each user. We have found the UNIX systems to be extremely useful because of their power and the wide variety of utilities built into the operating system. But we have been limited, by the small number of statistical packages available under UNIX, and still rely on a mainframe for most of the statistical analyses. EQUIPPED FOR THE FUTURE? Paul Dobbins, U. S. Department of the Treasury The Office of Tax Analysis (OTA), which is part of the Treasury Department's Office of Tax Policy, is responsible for three major functions. First, providing revenue estimates for the Administration for the budget and its quarterly reviews, as mandated by law. Second, providing on demand revenue estimates of tax proposals. Third, providing economic analysis of current and proposed tax legislation, often on very short notice. Timeliness is of the essence for the work of OTA to be of any impact during the sensitive, if fast-paced negotiations that are characteristic of mark-up whenever tax bill is pending. OTA specializes in what is called micro-simulation, which in our case is simply defined as modeling the responses of tax or household units to tax-law changes on an individual basis and weighting up the sample resists to get population estimates. Our data is primarily tax-return information, but we are making increased use of other sources. Even though our data files are relatively small samples, they are still large data sets and have made us largely mainframe bound. (Micro data does -56- not imply microcomputing!) But we have begun to use microcomputers increase overall office productivity and, having seen the light, we are attempting to push our frontiers even further out to the limits of the possible. We would ideally have a triad of computers/computer systems supporting our work. At the top or first level would be the individual microcomputer workstations bringing the burgeoning new wave of software development into the hands of tax economists, and lawyers. Currently only, our economists and computer specialists have microcomputers: the Z80-based CP/M machine, the Superbrain (TM), a fine machine in its day but certainly no longer in the mainstream due to memory and system limitations. Our economists have benefited greatly from having these microcomputers. Most of the staff were quickly converted into very proficient users of Wordstar (TM) and Supercalc (TM), and several have demonstrated considerable programming talent. The programming staff has found its burden lightened somewhat by a transfer of focus by staff wherever possible from the mainframe to the micros. But as described above, we are mainframe bound by the nature of our micro-simulation bread and butter. The mainframe is the third and fundamental level of our triad. What we're hoping to implement is a second level: a mini- or supermicro-based network linking the first and third levels together into a smooth, efficiently running system. Why this is needed can be illustrated by a generalized paradigm of how OTA often does its work. A particular tax proposal will need to be reviewed and analyzed in the course of an afternoon. First, the appropriate mainframe simulator will be run and the results brought to the staff, who may then input the numbers into an analytical framework they have developed (e.g., in Supercalc (TM). These modified results then may be inserted into a document residing on a third device, a stand-alone word processor. Finally, time being we have results, but not without considerable carrying paper from office to office and re- inputting numbers at each step along the way. This is, I submit, an old story for many an office. We have duplication, wasted effort. And yet the very presence of microcomputers has made the process faster and more reliable. A triad of micro-mini-maincomputers seems to fit our demands perfectly. The new Micro-Vax (TM) may very well fit into the-second slot. The workstations could easily be IBM PC's and look-alikes or DEC PC's, while any mainframe fits the foundation (we're running on a UNIVAC 1100/81,series and will soon have an 1100/92). Our ideal system would make it, possible to share software and software tools at the local and network levels, and allow us to easily move text and estimates across and among all of its levels. The intermediate level also goes some way towards narrowing the software gap between mainframes and micros by providing considerable computer power and much of the much of the latest software design. We can only anticipate the hardware advances that will narrow the gap even further. Finally, it cannot be overemphasized that the introduction of microcomputers has fundamentally altered the way OTA does business. But perhaps the -57- greatest lesson learned was how much better we can do than our current partial solution. In an environment that features considerable interaction among many different specialists, a network unifying all computer assets seems to be the only way to go. CONCERNS ABOUT DATA INTEGRITY, SECURITY, AND ACCESSIBILITY IN AN ENVIRONMENT WHERE MICROCOMPUTERS AND MAINFRAMES ARE INTERFACED Dick Shively, U. S. Department of Agriculture A. Background (History) The NASS is known as a data collection agency and reporter of agricultural statistics. In this line, a substantial amount of effort is directed toward list maintenance and data collection. As the agency moved into single- and multiple-frame probability samples for more and more indications, large-scale computer resources became extremely critical to allow evaluation of results in a manner timely enough to meet the reporting schedules, as well as to support much more exacting requirements for list maintenance. A large proportion of the field office (SSO) efforts are directed toward maintenance of the lists associated with their individual state, as well as collecting and analyzing their data. The NASS estimating procedure normally consists of each SSO preparing the indications and estimates for their individual state; then the Crop Reporting Board reviews all of these estimates to arrive at regional and national estimates. Since many diverse commodities are estimated, and sample sizes are fairly large to cover the desired geographic areas, data conversion is a major task. For this reason, all of the SSO's are mainly equipped as remote-job-entry sites with high-volume data-entry equipment. While the NASS has been a proponent of "generalized" software for some time, operations of this software at multiple sites required that installation and maintenance activities were duplicated for each site. The NASS history in mainframe computing has progressed from each of the field offices (SSO's) being responsible for obtaining their own computer resource locally to the current approach of providing a single large-scale commercial time-sharing-network vendor who can provide adequate resources to satisfy all of the agency's requirements. Until the bulk of national interest surveys were processed on the. time-sharing vendor's equipment, it was sometimes difficult to determine what procedures were used to obtain indications and even more difficult to review multiple state outputs from different reporting formats. Even shared required modifications to be used on different brands of hardware, and there was little assurance that these modifications would always provide identical results. -58- The introduction of microcomputers into the NASS offers some possibility of encountering the same problems recognized when local computer capacity was utilized. However, with care in selection of commercial software to ensure standardization where necessary, these devices offer substantial opportunities for improvement in personal productivity. B. Data Accessibility The typical processing method for survey data in the NASS for national surveys is for the Washington, D.C., staff to provide general guidelines for the type and amount of data validation to take place, as well as the summarization techniques. The SSO staff provides detailed validation specifications appropriate to their state, taking into consideration any specialized local conditions. SSO personnel will collect and validate the data, and summarize to the state level. The D.C. staff consolidates all of the state level information into regional and national values. The majority of the post survey analysis is also accomplished by the D.C. staff, using the data from each SSO. For this type of approach, the data values need to be available to people at widely-scattered locations. Storage on the microcomputer devices does not satisfy this requirement, since only one user at one location can access the data. The same accessibility requirement also holds for Crop-Reporting-Board released values. The D.C. staff reviews the SSO recommendations in establishing regional and national values. Following the Crop Reporting Board review during lockup, the state, regional and national estimates are made public on a known date and time. These estimates, normally released at 3:00 p.m., must be immediately available to each of the SSO's to prepare reports emphasizing those items of local interest. In addition, many people outside of the NASS are allowed direct access to these published values. Microcomputer usage in the NASS appears to be best adapted to play a support function, rather than providing a source of computational power. This includes primarily office-automation functions, such as word processing and spreadsheet analysis. Because of the volume of data, the stringent time constraints imposed on processing the data, and also the geographic distribution of processing, the data needs to be stored in a common repository. This allows each individual SSO accessibility to their own data, while still making the same data available to the review staff in D.C. A recent "Viewpoint" column in INFOWORLD made the point that data security and microcomputer-enhanced productivity are incompatible. The relationship is very strong, in that the more stringent the security measures used become, the less accessible the data is with corresponding losses in productivity. This column's analogy is that security on microcomputers is similar to inventorying pencils and paper. One current weak point with data accessibility, both from the mainframe and the micro standpoints, is a lack of communications ability. A reliable file transfer ability, consistent from machine to machine, and allowing usage of the strong points, of both mainframe and micro, will enhance our data processing capability. -59- C. Data Integrity Storage of data on microcomputers in the NASS environment, for more than a temporary working basis, has a tendency to lead to integrity problems. This happens whenever the same data is stored in more than one location, whether it is on a mainframe or microcomputer, or in a file cabinet. Anytime that a value has occasion to be changed, unless all occurrences are simultaneously changed, someone will likely accept the wrong value as correct. Version control is another name for data integrity, and when using individual stand-alone microcomputers this is difficult to handle. Each machine has its own copy of software, so whenever a new version becomes available every micro user must be provided with a copy. This is compounded by those machines having only floppy-disk drives, where the same software and/or data may appear on many diskettes. To upgrade to the new version, all copies must be located and modified. The problem of maintaining data integrity on microcomputers is the same one we have been battling for years - a single copy must be identified as that containing the "correct" values, and any accesses must be directed to this copy. Local Area Networks (LAN) provide some help in those situations where data is needed in a small area, where all users can be contained within the LAN. In this way, each user of the LAN can have accessibility to the same values, which will alleviate the integrity problem. D. Data Security The most secure system is undoubtedly manual although it may not be very productive. Microcomputers are not the place to store data that is, sensitive, unless special considerations are made such as a locked environment and limited access. Floppy disks are extremely easy to duplicate and just as easy to carry away from an office without detection. However, this is a human security problem and not a microcomputer security problem. An authorized person can just as easily remove a printout from a mainframe computer or pages from a record book in a file cabinet. A "Jim Seymour" column in PC WEEK suggests increasing security by locking up, diskettes when not in use, checking them out for usage, and encrypting the data stored on the micro. To me, this seems contrary to the usage of a PC as a productivity tool. The main security consideration that should be given to microcomputer usage with sensitive data is the RF or radio-frequency emission associated with them. Using fairly inexpensive devices, these emissions can be recorded from many feet away from the computer and reconstructed into the data that was being processed. This can be solved with Tempest machines or RF shields. In closing, I would like to say that I think that microcomputers are an excellent productivity tool. We need to be aware of their strengths and limitations when designing projects for them to accomplish. -60- QUESTIONS AND ANSWERS The following discussion reflects questions and answers related to the "Applications of Microcomputers" Session, which involved the Chair, Speakers, and certain members of the audience. Mr. Steele: We have heard several interesting applications discussed here this afternoon, and the purpose was to give you some sense of the breadth of applications that microcomputers are being used for. I We have seen applications that were very simple uses of spreadsheets, spreadsheet templates, and data bases, ranging on to econometric projections, graphics, and then the sophisticated time-sharing multiuser UNIX-based systems. We wanted to give you a sense of the kinds of things that are being done so that we can talk more about where we are going. What more can we expect out of microcomputers and what kinds of developments would we like to see in terms of hardware and software to achieve greater productivity from the use of microcomputers? In that light, I would like to start off and ask several questions. Ql: Linda, you seem to be a strong proponent of the IBM microcomputers. Could you explain why? Al (Ms. Atkinson): Being with the Federal Government, I am probably supposed to start with some sort of a disclaimer about products; therefore, my comments do not constitute a product endorsement. When we first started acquiring PC's, it was done by our Economic Research Staff; and we had a proliferation of all kinds of machines including some that are no longer being made. It became clear that not only was this hard to-support, but we were going to have problems in being able to move data from one machine to another and possible communications problems later should we want to network them. Also, the situation that we are seeing how is that the newer software such as SPSS PC and SAS, are being developed first for the IBM machines and later, if at all, for other machines. So if you are on these other types of computers, you are going to lose some time during that waiting process. I would say by now that probably about eighty percent of our PC's are IBM compatibles or IBM. Q2: Brian, I see you seem to be a very strong proponent of the UNIX- based machines; and you give very strong arguments for, well, a good discussion of the relative strengths of them. We have heard a lot of sales hype about UNIX being the operating system of the future. Is that really going to happen? A2 (Mr. Carney): AT&T would certainly have us believe UNIX will be the operating system of the future. UNIX offers some special capabilities that you don't get on other ones. One in particular is software compatibility across UNIX machines. As an example, CATI software will run on practically, anybody's UNIX box, anywhere. So, we have acquired a degree of vendor independence right there. -61- But, as far as hardware is concerned, say for example you want to network machines together; at that point you are down to a level where UNIX doesn't really do you a whole lot of good, because the individual vendors choose to implement UNIX differently on different types of hardware. There is some effort to relax those restrictions. Sun Microsystems has a hardware and software implementation of networking that can be done that is largely vendor independent. But that's not something in general that you can get and walk out with today. UNIX suffers from being extremely complicated. As I mentioned earlier, you have to have an expert around or available when the machine goes down. Because of the size and complexity of the system, it does take the user quit a while to be productive. In terms of the so-called popular software (Lotus, Symphony, etc.), none of that is available under UNIX. Whether it becomes available depends a lot on what AT&T can pull off with their market. They say it will happen, but we haven't seen it yet. Q3: Paul, as you look towards implementing your design of the future, having this triad or three levels, do you anticipate. making them all one brand or one vendor and one standard software line; or what kind of problem do you envision in the interchange of data between packages? A3 (Mr. Dobbins): First of all I would say that we don't have enough experience at the moment to say exactly what it is that we will have a year or so down the road. I personally would look forward to having more flexibility and not essentially going toward one standardized system or one standardized software. We ultimately might, want to cut bait with our mainframe. As the minis become more powerful and we have super-micros on the desk, we may begin to use the mainframe for initial processing and then download a data base to one of our PC machines. I am looking forward to experimenting for a while before I will be able to say too much more. Q4: Rick, you mentioned that you had some problems with output devices. Could you expand on that a little? What problems have you encountered, and what could the vendors do to solve some of these? A4 (Mr. Hayes): What we are looking for is a good vehicle for integrating our text with our pictures or graphics. We haven't found that and we still are looking. One of the things we find is that there are a tremendous number of products out on the market. We are a small shop and we are doing our own research. So far we haven't had any luck with Text/Graphics integration. Once, you start experimenting you find that you can spend a lot of time looking at various packages. The other thing that ou will find is that people will suggest different software you can look at, you can try. Each of these experiments takes time out of your production so there is a tradeoff between finding something that will work for you and finding the best and latest product. Once you have found something which works for you, I think you should stick with it for a while and find out its complete capabilities. I When we talk about output devices, we are having problems withtfinding print devices that will handle -62- graphics, pie charts, diagrams, as well as text, at a reasonable speed. I think most of the products presently available have problems with integration of text and graphics. It is not an insurmountable problem, but it slows you down. Q5: Dick, you expressed several concerns about data security when interfaced with mainframes. Does that mean that you think people shouldn't have microcomputers hooked to the mainframes or that we should lock everyone's door or what? A5 (Mr. Shively): All I was saying on that was that the microcomputer has a problem with security. Mainframe computers are very well adapted to securing data, securing devices, making things pretty secure. Microcomputers, by design, are a productivity tool. If we try to add security to it beyond what is necessary, we have reduced a lot of the productivity gains that we have possible there. Mr. Steele: At this time I would like to open the floor up for general questions. Please stand and identify yourselves by your name and agency before asking the question. Qg: This is a question that is appropriate to organizations that haven't yet capitalized on current technology and, are looking to get into the business. What is the appropriate level of computer power for economic feasibility forecasts and statistical work? I wonder if anybody can comment on the relative merits of waiting until 16-bit -- -microprocessor technology is developed. A6 (Mr. Hayes: As it turns out, we recently had a contractor come through and evaluate our operation in terms of what ought to be our computer configuration in the future. Basically, their conception is that anybody who is working with any sizable base is not going to get along without it. Their suggestion is that you I use the mainframe for your heavy duty computations and as a storage device to deal with some of the security problems we talked about here. They then are suggesting to us, at least, that we look now to microcomputers, 16-bit machines that can network into the mainframe and be used either in the stand-alone capacity for test purposes to try out graphics, to upload and download, or I as remote terminals. I know we are going to maintain our mainframe capabilities while doing as much as we can on micros, because micros are much simpler and flexible to use by analysts. Where we need to spend time is on gateway architecture that links our micros to the mainframe and allows analysts who are not experts in programming to get in and out of the mainframe and get data in and out of software packages. A6 (Mr. Carney): In the UNIX-based workstations that I am familiar with, the real 32-bit technology comes on a single chip with a powerful box. I think people are looking towards Motorola 68020 which should be in production sometime in the Fall, 1985. It takes a little while, not I too long, under UNIX for the software to catch up with it. It may be a while before it's really fully mature as you are looking for right now. You've heard this old song before, but you really want to find what software you want first, and then-figure out what sort of boxes you can afford, that it will run on. You really are looking for productivity after all. -63- Q7: Each of the panelists has described a decentralized system, especially one using spreadsheets and/or spreadsheets with something else. How do you sure the integrity of the data that you are using; and second, how do you insure that whatever statistical standards may exist in your agency are being followed in those decentralized system? A7 (Mr. Carney): I can talk about it a little bit in the research environment. Basically, we always have to use the data on the mainframe as the benchmark data. That is the correct data, and anything we pull off has to be pulled off checking protocols; and we can't change those numbers. As far as the statistical standards are for the research unit, you pretty much have to depend on the review process, review by our peers. A7 (Mr. Shively): I second Brian's statement. We basically consider the mainframe data to be the official source unless some special circumstances exist where there is no need for it to ever be on the mainframe. But for any data that is shared or that is nationwide in scope, the mainframe is the official source of control. If you pull a copy of that to your micro to run it through a model, you are working with data that at that point in time is not official copy. It may be a copy of the official and you can Use it for your model or plan -- anything like that -- but if you want to go to publication with it you need to go back to the mainframe for the official copy. QB: This is a comment. We are using more and more graphics lately. There are some packages on the market which were released recently that will capture the picture of the U.S. map -- then you can enhance that map. You can add titles, text, whatever you want to these graphics. The program works in the background -- one is the graphics partner. You can call it like the Sidekick package. It captures the picture you have on the screen and you can enhance and modify the picture. You have the integration of text and graphics. I suggest that you try SMART; they have the software package. Also, take a look at GEM which is just coming out by DIGITAL Research. Q9: I have a question for Linda Atkinson. I think you mentioned a spreadsheet that ran for forty-eight hours? Did you have an 8087 (math ematical functions) processor. A9 (Ms. Atkinson): Yes we did. It is a very large model and this is the simplified version of it on the micro, and yet it took that long. It's very large and it ran very long; but yes, it was an 8087 chip in there. They are hoping that the AT is going to improve the situation. If not, they may not be able to ultimately move to the micro but will need to use the mainframe as well for that model. Mr. Steele: I have one anecdote that I would like to share with, you about computers and applications to computers. I became involved with microcomputers in 1978. By 1980 I had most of my functions already automated in my office on the computer, and I called up one of my colleagues who was Secretary of the Crop Reporting Board to get some information from him on when one of our employees had last been in on the Crop Reporting Board. He said, hold on just a minute, I need to check my data base, and within thirty seconds he had an answer back for me. He read back all the -64- times that this guy had been in and when he was next scheduled and I was really impressed. I couldn't believe how quickly he had all of that information. I didn't have a nice data base like that, so I asked him what kind of data base he was using, and he said, "file cards." certainly, the purpose of telling that anecdote is to illustrate that there are certain applications that are best left to a manual procedure, and then oftentimes I encounter people trying to automate procedures that aren't well defined manually. I think that any time people try to automate procedures that aren't well defined manually, they are expecting magic; and they usually end up with a lot less than what they are asking for. Ms. Atkinson: I would-like to make one comment about acquiring statistical software. If any of you are looking for, software or a good source of reviews or products that people have tried, there are at least two electronic bulletin boards that I am aware of. Capital PC users group is a special interest group for statistics. Charlie Hallihan who is one of the chairs of that group has left some information on the desk outside which will tell you how to access their bulletin board or attend their meetings. There is a SAS users group, even though SAS is not yet on the micro. This is a group right now who likes SAS and also uses micros. I guess they like to discuss their applications of getting data back and forth from SAS. They have a bulletin board also. A paper was presented on that at the SAS users' group meeting last month which I think had the phone number of the bulletin board. Otherwise, you can contact me. As I said, there is software available on these bulletin boards that people are willing to share. There is a SAS macro on the SAS users"group bulletin board. -65- SESSION ON EXPERT SYSTEMS SESSIONS SUMMARY* Both the DATAPLOT and Editing and Imputation systems described here were not developed by computer scientists or knowledge engineers but by subjectmatter specialists who were presented with new tools to assist them in improving their jobs. Although "expert system" tools and techniques were developed by a community of researchers who happen to call their field "Artificial Intelligence," the tools and techniques can be considered to be useful in their own right without the necessity to call the result "Artificial Intelligence" systems. In fact, there is good reason to say that none of the existing expert systems is truly intelligent or even expert. A true expert has the ability to learn new rules in his specialty and to apply common sense reasoning in cases where specific rules don't happen to reside in his "knowledge base." Both "learning" and "common sense reasoning" are areas of artificial intelligence research in which there are only a handful of active workers and in which progress has been slow. Contemporary "expert systems' neither learn nor exhibit common sense behavior when it is warranted. But, as a set of tools and techniques, expert system technology has proved to be useful for some specific applications. We have seen two examples of such applications today in Mr. Filliben's DATAPLOT system and in Mr. Greenberg's edit and imputation software. I think it is worth noting that these successful expert system examples were done by mathematicians and statisticians, rather than artificial intelligence specialists or even computer scientists. Less than two years ago, there were dire warnings that expert system techniques could not be generally applied because there were so few PhDs being granted to people who had specialized in artificial intelligence research. what we are finding though, is that the techniques important for developing expert systems can be taught to people in other specialties. In fact, many organizations (.including the Digital Equipment Corporation and IRS), given the choice of training artificial intelligence researchers in application domains or training subject-matter specialists in the tools and techniques of artificial intelligence have opted for the latter. The Bureau of the Census and the National Bureau of Standards show that good subject- matter specialists are perfectly capable of learning the techniques without any deliberate training program by their agencies. One of the reasons to have a panel on expert systems at a conference on statistical uses of microcomputers is that such systems.as DATAPLOT can be adapted to personal computers as soon as the personal computers are powerfulenough to accommodate them. To expand on that theme please note that ________________________________ *Norman Glick, National Security Agency -67- today's personal computers are already much more powerful than the "supercomputers of the early 1960's". We are guaranteed, given what can already be seen in computer engineering laboratories, that the inexpensive personal workstations of the future will be powerful enough to accommodate the kinds of systems that need mainframe computers today. But the existence of that future power for the benefit of an individual will make even more important some of the research that hasn't been discussed today but is part of what the artificial intelligence research community is concerned with. The ability to provide a user model to accompany an expert system addresses some of the points made in the talks and the question period today. Users do have different levels of sophistication and expertise of their own. We would like the system to accomnodate to the needs of the user, even to adapt to the changing expertise of a single user. The same person who might need substantial help in using a system for the first few times, might ultimately consider verbose assistance to be a nuisance. The ability of a system to adapt to the evolving needs of such a user is a subject of active research in the artificial intelligence community today. Since it was announced that this session might be on "pure fantasy," and since what we have heard from the Bureau of the Census and the National Bureau of Standards has been on eminently practical systems (whether they are called expert systems or not), perhaps we should end with some speculations that some might consider fantasies. One class of artificial intelligence research that promises to have relevance to statistical systems of the future is automatic programming. Both the editing and imputation and the DATAPLOT systems required that the statisticians and mathematicians write programs. Whether they were intentionally building expert systems, unconsciously building expert systems or simply writing a program to assist in statistical analysis, they needed to provide significant detail about how the computer should do what they wanted. If sufficient expertise of the programming art can be captured in an expert system and can be combined with sufficient expertise in a particular domain, even one of the domains we've heard about today, then the combined system might permit a user to state what he wants done, rather than the details of how he wants to do it, and a program to perform the job could be generated automatically. To a modest extent, so-called fourth- generation languages provide existence proofs of such systems today. These fourth-generation systems work in very limited domains (e.g., payroll And inventory control), but there is substantial research aimed at increasing the set of applications for which such approaches are practical. Some even see this class of activity as the future of software engineering. Please note that some differences exist in the life cycle of standard software relative to the life cycle of "artificial intelligence " software. It is clear that current software engineering techniques will not provide the quantity and quality of software required in the future. More statisticians, mathematicians, and psychologists will need to tell computers what they want done in the future without computer-specialist intermediaries. Let's hope the automatic programming "fantasy" becomes less fanciful so that, in the future, more subject-matter specialists can be their own "knowledge engineers' rather than to be dependent on programming specialists. Statisticians shoudn't have to spend inordinate time learning the details of how to use specific computers when their talent is to apply their mathematical and statistical knowledge. -68- INTRODUCTION Terry Ireland, National Security Agency It's possible that the organizers of this workshop, and I was one of them, wanted to have one session on pure fantasy and this might be it. Building expert systems that model in software the behavior of human experts, and evolve in a natural way so you can more clearly understand the expert, is so clearly impossible that there must be-- and is--an unlimited amount of high-priced advice on how to do it. Some statisticians may argue that random sampling and surveying procedures on computers can already model the experts, so we really, perhaps, have two questions: What do we mean by "an expert"? If we know what an expert is and if we have one on hand, how do we go about modeling his behavior? In order to give some more practicality and reasonableness to this presentation, we have made absolutely certain that none of the speakers is a computer scientist. However, they are skilled developers and users of software systems, and they have built expert systems. George Lawton is a psychologist with the Army Research Institute. He has an interest in systems that support the interface between human factors and computer science. He will give the introduction to expert systems. Jim Filliben is a statistician with the National Bureau of Standards who has an interest in systems that model and support statistical expertise. In fact, one of his software systems is said to be the most requested piece of software from NTIS. Brian Greenberg is a mathematician with the Bureau of the Census. He has an interest in expert systems for data editing and imputation. Roughly a year ago I gave a talk on expert, systems -- an abstract talk because my practical knowledge was limited. After the talk, Brian came up and observed that he wasn't sure about the jargon that was being used, but he felt that he had built an expert system. The Rapporteur is Norman Glick. He and I are both computer scientists and he may try to have the last word. Mark Winer, an economist from the Office of Management and Budget, is our Discussant who will keep us honest. Computer scientists are trying to create tools for the development of expert systems and to make them commercially available. They are also trying to give the impression, that they are the most skilled at eliciting information from experts. Thus, the once humble programmer now calls himself a knowledge engineer. Ultimately, a knowledge engineer is a person (sometimes statistician, psychologist or mathematician) who makes the most substantial effort to understand and model the expert. -69- EXPERT SYSTEM TUTORIAL George Lawton, Army Research Institute A number of things have happened recently that would lead me to change some of the things I say today had I the opportunity to do so. Fortunately, some of those things are available to you through your local newsstand. One was the publication last month of a magazine called PC, not to be confused with PC World and PC Junior, which has a section describing a number of proprietary software packages which are available on microcomputers; for developing your own expert systems. Anyone who is inclined to go out And build an expert system may look at this article and review the software. The other was my attendance last week at a conference at Bell Labs which brought together a number of computer scientists and statisticians who were all interested in what I am going to be discussing here this afternoon was, surprised to find that there are at least sixty people from the United Kingdom and the United States and at least a couple o f European countries who are interested in this subject. No less than John Tukey of Princeton University believes that this is the next wave of software for statistical applications. It seems to me that this is something of a coming concern. Expert systems come out of laboratories for research in artificial intelligence. Artificial intelligence, as I think everybody in the world now knows if he reads the popular press, is a line of research developed at MIT, Carnegie-Mellon University, and Stanford University in particular, concerned with building computing machines which will emulate high-level human cognitive capabilities. In the earliest incarnation of artificial intelligence, primarily at Carnegie-Mellon University, researchers tried to develop very powerful general-purpose problem-solving algorithms which would give a user appropriate support in tackling a problem that a human expert could solve. Those programs were largely failures. As a consequence of those failures, research in artificial intelligence has converged on one organizing theme in the past decade: to be intelligent, computer programs have to be able to access large bodies of knowledge. It isn't their deep problem-solving capabilities that make people intelligent, it's the fact that they know a lot about the world in which they operate, and so it must be for programs. An outgrowth of that discovery is a type of computer program which is essentially nothing but a large collection of knowledge and a relatively simple mechanism for accessing that knowledge and using it to solve various problems. These programs are called Expert Systems. I intend to talk about what expert systems are, about how their software is written, about the software techniques that expert-system developers use, and finally about what statisticians might want to do with expert systems. -70- ARCHITECTURE BASED ON THREE SEPARATE MODULES 1. KNOWLEDGE BASE 2. INFERENCE MECHANISM 3. USER INTERFACE Display 22. Following good programming practice expert systems are modular and they usually have at least three fundamental modules. The first is a collection of facts and rules in a knowledge base; the second is a relatively simple program evaluator we call an inference mechanism. The third is the interface with the user which gives the user the illusion that he is dealing with something that's intelligent (see Display 22). That's what, makes the machine capable of pasting what we call the Turing Test. This test is very simple. If a person acting as judge asking questions cannot tell the difference between a machine and a human any more frequently than he can tell the difference between a man and a woman, then the machine must be considered intelligent. Of course certain rules prevent the obvious shortcuts (for example, a terminal should be used to ask and to answer the questions). And, of course, the respondents (woman, man, machine) are not required to tell the truth. In the knowledge base we have some knowledge, and it has to be represented in some form that can be used by the computer. Almost every expert System that I have had occasion to study uses one basic knowledge representation (possibly supplemented by some others): some kind of conditional structure we call production rules. The fundamental idea of a production rule is strikingly simple. It has two parts. The first asks if something -- tome state of the world -- is true. The second part takes some specified action if that something is true. Programmers know them as if-then constructs in programming languages. In conventional programs they are usually scattered throughout the program text. In a knowledge base they are collected together into a list of rules. 1. Production Rules There are two classes of production rules. A. Situation-Action Rules (which are essentially data- driven procedure invocations) e.g., if the data are skewed, then call a re-expression procedure. B. Conditional Assertions e.g., if the case has V1 = 10 and V2 = 20, then it is an outlier. Display 23. -71- Production rules really break down into two different classes. One of them is something we call a Situation-Action Rule. It's essentially a pattern invoked program. The other is really a conditional assertion (see Display 23). Most expert systems are based on one of those two kinds of production- rule systems. There are at least two other ways of structuring knowledge that are widely used in expert systems. Conceptual Networks. First, you may have a large number of concepts to represent in the knowledge base, and the concepts might be related to each other (for example, a class-instance relationship or a set-subset relationship). There are species of animals, and each animal species has members. It doesn't make any sense to represent all of the features of each of those members, so they may be organized into conceptual, often hierarchical, networks which show these relationships. Frames and other structured objects. This is really an extension of the idea of a record with the record containing not only the usual type of data, e.g., name or age, but also other records as data. Moreover, the data can be specified as a computation, e.g., if you know the person's year of birth and you know the year, you can compute age without storing its actual value. Frames also contain default information to be used or computed when the required data is missing. 1. Forward Chaining e.g., If P then Q; If Q then R;, If R then T; P therefore T. explanation By P infer Q; by Q infer R; by R infer T. 2. Backward Chaining e.g., T if R; R if Q; Q if P; P therefore T. explanation To show T, first show R; to show R, first show Q; to Show first show P; P is true. Display 24. Inference mechanisms. Because we are talking primarily about knowledge that is represented as conditional structures, we must have some logical reasoning process to make use of them. For example, we can use some of the -72- basic rules of logic: if we know that Proposition P is true and if we know that Proposition P being true implies Proposition Q is true, then we can conclude from these two true statements (one a fact, the other a general reasoning procedure) that Proposition Q is true. Or, we can reverse the process, starting with a goal to show Proposition Q is true and reasoning backwards, looking for a sequence of statements that could bring us to the desired conclusion. Again, most expert systems use some variation of the first and second form of reasoning shown on Display 24. Propagation of uncertainties, statistical reasoning. reasoning, which we will come back to in a minute, attaches to each conditional structure something called a certainty factor. Statisticians might think that the certainty factor might be a probability because some vary between zero and one or maybe a correlation because others vary between minus one and plus one. In general, they are not that well motivated -- they are ad hoc. They are just numbers that somebody pulled out of a hat, saying this is how certain I am that this fact is true. The question is, how do they get propagated through a sequence of inferences? This is a difficult problem about which there has been much recent discussion. Inheritance of Properties: Again, we can represent objects in terms of a network of class relationships. By default, certain things may inherit properties from their class. If Fido is a dog, then Fido is warm-blooded, because Fido is a dog and dogs are warm-blooded. Heuristic Rules: In certain cases, no straightforward and logical procedure may apply, in which case you may apply what we call a Risk It Rule. It's a rule of thumb that says "if we don't know any better, do this. Meta-reasoning: Last, but not least, an expert system may have rules about rules, a form of reasoning about process often called meta- reasoning. It deals with reasoning about the representation of the problem. All of these methods have been incorporated into the handling of expert systems. USER INTERFACE 1. Read user input. 2. Provide user with useful output. 3. Explanation facility, to give the user a useful trace of the program's inferences. 4. Knowledge-base input and editing. Display 25. This is the most important part of an expert system. The distinguishing feature of an interface for an expert system, and this is a well-designed interface, is Item 3 (see Display 25). Expert systems, unlike other computer programs, explain the conclusions they reach. I would say there is no other really necessary feature for an expert system than this ability to -73- A third form of explain. Previously, I showed several logical forms for the reasoning process. They provide examples of what an explanation might look like. There you have a crude expert system. They are relatively unfriendly. They just say "by this rule, I infer this; and by that rule, I infer that" and so on, until-you get to a conclusion. Other systems are much better at knowledge representations, including diagrammatic representations of good inference and explanations in good English (see Display 26). Traditional Software Engineering -- Linear Program Development 1. System requirements 2. Software requirements 3. Preliminary design 4. Detailed design 5. Code 6. Debug 7. Test 8 Use 9. maintain Display 26. How do you go about building an expert system? The methodology that most people use is a little different from the methodology you might have learned in a basic programming or a software engineering class. Display 27 shows the steps that a conventional programming text might tell you to follow when building software. By the way, expert systems are just enlarged software systems. Alternative Approach Cycles of Progressive Refinement Preliminary Requirements Preliminary Knowledge Engineering Prototype I 1. Design 2. Coding 3. Debugging 4. Testing Prototype II 1. etc., etc., etc. Display 27. -74- This is an alternative list that is used by most people who have developed expert systems. Rather than starting from the specifications and going step by step by step through the program development, requirements analysis and the rest to a final software product, expert-systems developers follow an iterative process which begins with a small program that is written and tested, then elaborated, and written and tested again. In fact, the programming language used in this development methodology was designed to support the iterative and experimental development of software. The ability to express your ideas in a high-level flexible language enables the programmer to develop rapid prototypes or models of the system he wishes to define. LISP is the language of American artificial intelligence research. Notice that I said research. It's the language of artificial intelligence research. That may not mean that it's the best language for artificial intelligence implementation. It is a functional programming language. That means that programs written in LISP are functions.that can be passed around just as ideas are passed around and used where appropriate. They are more like mathematical functions in the sense that they have the mathematical properties that you associate with function, rather than properties that you would associate with FORTRAN function. That's a formal statement that I don't really want to defend any further than to say it really is true. Four Basic Components At The LISP Top Level (print (eval (read))) The first three are in the top-level loop of the LISP interpreter. 1. Reader 2. Evaluator 3. Print 4. A table of LISP objects which serve as a data base. Display 28. LISP provides a series of operations that you would have to make for yourselves if you were-going to write a program in something like FORTRAN. When you invoke LISP on a computer, you are invoking an endless repetitive loop which looks like this. That is a top-level interactive computing environment from which you can either ask for a computation or define a new function very much as you would interact with another person. Sometimes you are computing values or ascertaining facts; other times you are developing new ideas. Both activities are done in the, same environment. The innermost expression is ready and waiting for you to type something into the terminal which it will read. Then there is the high-level functional evaluator which knows how to evaluate any well-formed expression in LISP. Then it will print out the results of that. It continues to go through this loop. It is like a conversation (see Display 28). -75- Invoked by the functions in this loop are three programs: the LISP reader which includes both the ability to scan the characters entered and to format them into a LISP expression, the LISP functional evaluator which can evaluate or compute the expression entered, and the LISP printer which knows how to format and print the results of the evaluation. Moreover, the LISP reader also stores information about the names or identifiers read in. Most of this information is stored in a table that holds rules, values and names. This table is useful as the data base. It is more than a simple table produced, for example, by FORTRAN. Why would you want to use LISP? Because LISP is interactive, you can write a program and see how it works almost immediately. Compilation is unnecessary. Its modularity enables you to write small segments of code in the form of functions, checking each one as it is written. LISP doesn't require declarations, although good programming practice suggests they be included in the final product. This enables you to develop functions quickly for experimental use. LISP dynamically allocates whatever kind of data structure you want to use. That means when you call in a function, LISP will make it immediately available to you. This means LISP handles all storage allocation for you, allocating it when needed, cleaning it up when you are finished with it. Because there are no differences in LISP between programs and data structures, LISP can be represented as lists just as if it were another data structure. Therefore, it is easy for LISP to reason about or deal with its own functions just as humans examine their own procedures. As, a consequence, LISP provides sophisticated tracing and debugging capabilities. PROLOG Prolog is based on a general-purpose pattern matching and inference or theorem-proving mechanism called unification resolution. It is based on a formalism symbolic logic called first order predicate calculus. Basic Components: 1. Read and Print 2. Procedure evaluation based on unification-resolution, implemented as backtracking search. 3. A data base which contains the definitions of procedures and the facts needed by the program. Display 29. PROLOG is a sort of second-class language for artificial intelligence research in the United States, but it is gaining adherents (see Display 29) It has a big following in England and in Europe. PROLOG stands for PRogramming in LOGic. The idea is that PROLOG is a language based on a subset of first-order predicate calculus called Horn clauses. It makes use of inference procedures used in proving the correctness of logical -76- statements. It allows you to write computer programs simply by making statements about what you want to be true in a clause form. PROLOG provides many programming structures you would otherwise have to build. It can read program goals or procedures and print out the results for you. It's interpretive and it gives you all of the facilities of allocation of storage and reformation of unused storage in a manner similar to the support environment in LISP. You do not need LISP or PROLOG to build an expert system--but it sure helps. The two speakers who are immediately following me are going to talk about expert systems they built in FORTRAN. What's important is to identify and to make use of the distinguishing features of expert systems: the abstractions and program structure. How do we use expert systems in statistics? This is a review of suggestions discussed at the conference on Expert Systems and Statistics that I attended last week. First of all, I don't know how many of you are statisticians; but if you are, your knowledge, as distinct from the knowledge of laypeople who come to you for consultation or as distinct from pure subject- matter knowledge (for example, economics) consists of a collection of strategies for working with a set of data. And those strategies could probably be readily represented in the form of an expert system which knows what test to do next. Parts of the data need to be cleaned up. One of the most active areas I of research that I have found is the specification of individual statistical strategies in the form of expert systems, either as interfaces to existing statistical acronyms like S or SPSS or SAS, or in the form of a complete statistical system. Nobody wants to suggest that you can improve upon the capabilities of these statistical packages. What you may want to suggest is that it may be possible to improve their usability by adding something between the user and the package. Another area of research concerns reasoning with uncertainty. Statisticians have something to say about that. I mentioned the adding of certainty factors to knowledge bases in most expert systems. An active area of research is to determine how those certainty factors should be propagated through a rule system and how certain conclusions can be based on uncertain knowledge. Existing statistical software deals only with statistical ideas at the lowest level. It provides code to do things like least-squares fitting and so on. We want to use the ideas of expert systems to move software up one more level to deal with the abstract ideas of statistical strategy--the choosing of statistical methods and selective analysis of data in light of these methods. Our success in this area depends on the development and use of modern programming languages and on the development and use of expert system models. -77- AN EXTENSION OF STATISTICAL SOFTWARE TO EXPERT SYSTEMS James J. Filliben, National Bureau of Standards The outline for this talk falls into four general areas. We are going to be talking about the real relation of an expert system to a particular piece of software, namely, DATAPLOT. I am going to speak first about DATAPLOT to show how the expert system can be described with respect to, DATAPLOT. We are going to be talking about the general structure for the intelligent subsystem -- expert subsystem -- and DATAPLOT and go into the interpretation of conclusions mode and the analysis guideline mode. The last mode deals with providing a guide for carrying out data analysis. We will go through a particular data problem. DATAPLOT is a high-level interactive statistical system with its own language, a high-level language with English-like commands. It was designed at the National Bureau of Standards (NBS) in 1977. The National Technical Information Service (NTIS), has been distributing it for the last three years. The software is written in FORTRAN. The cost is $1200. It is the most heavily distributed software of its type at NTIS. , It has been installed at about 200 sites. Next, year it, will be the most heavily distributed piece of software, period. Its primary capability is graphics. That means it can run on Tectronix, HP and various other graphics terminal devices and on a variety of mainframes. It has both analysis graphics and presentation graphics. There are extensive additional capabilities in graphical data analysis and nongraphical data analysis, modeling and fitting, mathematics and diagram graphics. At the National Bureau of Standards (NBS) we are interested in modeling and fitting data In particular, we are often interested in fitting nonlinear models. Moreover, we make extensive use of applied mathematics and diagram graphics. By diagram graphics I mean the construction of schematic diagrams. The NBS is an engineering and scientific research organization. We have people that like to make schematics. We spend our time making schematic diagrams and charts. This component of DATAPLOT supports the automation of that work. There is a heavy emphasis on data fitting in order to test underlying assumptions. The graphical displays are important because they provide insight into the underlying structure. Insight is important if you must go into court, for example, and defend your understanding of mechanisms at work in the data you have analyzed. Three notable cases that have arisen in our area are the analysis of the draft lottery, the argument over the use of daylight-saving time a couple of years ago, and data concerning the Alaska Pipeline. Graphical analysis was a critical component in those projects. Display 30 shows the structure for DATAPLOT. It is a data analysis activity on one side and a mathematics activity on the other side. Three common activities common to both are plotting, fitting and various transformations and function evaluations. -78- [GRAPHIC] \WP1479.GIF Display 31 shows the typical commands you can issue to DATAPLOT. They support plotting (commands 1, 2 and 4), fitting (commands 3 and,5), and function evaluation (command 6). TYPICAL COMMANDS 1. PLOT X Y 2. PLOT EXP(-X**2) FOR X = -3 .1 3 3. FIT Y = A+B*EXP(-ALPHA*X) 4. BOX PLOT Y X 5. ANOVA Y Xl X2 X3 6. LET A = ROOTS SIN(X**2)+EXP(-X) FOR X = 0 TO 5 IL Display 31. Displays 32 and 33 give examples of the display capabilities of DATAPLOT. All the graphics shown can be generated with any sort of system. Whether you have TECTRONIX, Spot 10, IGL or any graphics terminal, the important question is how long does it take to generate the graphical display. If it takes more than thirty seconds to a minute to do so, we lose the continuity so important to human-machine interaction. In data analysis the only concern is finding underlying structure, and getting insight. When generating graphics gets in the way of the objective of finding underlying structure in data, we lose control of the analysis. Thus, the utility of graphics software is measured not in what it can do per se, but rather in how easy it is to do it -- how easy it is to,understand, write, modify and communicate the instructions. The DATAPLOT Intelligence Subsystem is an augmentation of the current system to provide information and guidance as if a statistical consultant were present during an analysis. Basically we want to provide an expert subsystem that asks the right questions as we step through an analysis. In order to get insight -- more than answers -- asking the right questions is just as important as coming to the right conclusions. The expert system interacts with the analyst, setting the pace and posing questions along the way: have you checked this, have you checked that, what does such and such a plot look like? It will look like such and such so perhaps you should go in this direction, that direction, etc. Display 34 shows some of the human-machine interaction problems that must be addressed in an expert system. If the user requests an operation like BOXPLOT, he should be able to see a one-line definition and the rationale for its use. In other words, if the expert system recommends a certain course of action, the analyst should be able to ask questions like, "What is the penalty if we don't follow this?. -80- [GRAPHIC] \WP1481.GIF [GRAPHIC] \WP1482.GIF HUMAN PROBLEMS (DESIGN GOALS) DEFINITIONS RATIONALE LINKING TOOLS KEY TESTS HYPOTHESES/CONCLUSIONS VARYING EXPERIENCE Display 34. The expert system should support the linking together of statistical analysis tools, often in unexpected ways. Data analysis is primarily sequential and interactive. We step through the data, step through the analysis; and at each step, the next step is dictated by what we have seen before that. Scientist's often deal with correlation plots to see if there is any correlation structure in the data. DATAPLOT supports a correlation- plot command and many other graphic commands, and analysis. However, if someone asks for a correlation plot, the expert system should assist the analytic effort by carrying out appropriate statistical tests behind the scenes. Another important but time-consuming aspect of data analysis (especially when you are writing research papers for the general science community) is the need to frame your hypothesis and conclusions in proper statistical terms. An expert system should support this formulation. We have found that to be very helpful to the average scientist and engineer. Every paper that goes out of NBS goes through our statistics review process to guarantee that hypotheses and conclusions have been properly stated in statistical terms. The last aspect is the varying experience. Any expert system is going to have a problem dealing with different kinds of methods. No one expert system is going to be ideal because users have various degrees of experience. That's a very sticky problem. A tough problem. You don't want the expert system to be so simple minded that an experienced analyst must go through 20 menus just to carry out a legal analysis. On the other hand, someone with limited experience needs the extensive guidance that 20 menus would give. Display 35 shows the general content of the expert system component of DATAPLOT. -83- SUBSYSTEM OUTPUT (THE EXPERT SYSTEM) SEQUENCE OF MENUS DATAPLOT COMMANDS CAUTIONS/CONCERNS MENU EXPLANATIONS ADDITIONAL TESTS RIGOROUS STATISTICAL CONCLUSIONS Display 35. Sequence of menus: Each menu should have guidelines at the bottom of the menu explaining not only the current menu but offering suggestions as to which menus to, select next for specific analyses and why. These suggestions can include specific DATAPLOT commands. Moreover, within the menu environment the cautions and concerns about the form of the analysis should be displayed clearly (e.g., a caution about the data not following a Normal distribution). The user should always have access to HELP functions for each menu These menu explanations should include a description of where the particular menu fits within the entire collection of menus. Additional tests: I mentioned the idea of performing statistical computations behind the scene. Although the analyst may be unaware of their specifics, he may want to make use of their results at a later time. The expert system is aware of this and can provide them. DATAPLOT thus has 2 expert subsystems: A consultant-style expert system which offers expert guidance for thoroughly and rigorously carrying out a data analysis; and a data-interpretive expert system which chooses a test, applies the test to the data, interprets the output, and formulates a rigorous statistical conclusion (couched in proper statistical terminology). The remaining displays provide some idea of the analyst's interaction with the expert system component of DATAPLOT. As you can see the need for a great variety of interactions in a large expert system requires a lot of thought and a large comprehensive software system. If any of you want to see DATAPLOT in operation, we are out in Gaithersburg, and we will be glad to come out and demonstrate it locally. -84- References Filliben, J. J. and Fong, J. T. (1984), "DATAPLOT as an Expert System for Data Analysis," available from American Society of Mechanical Engineers, June, 1984. Hahn, Gerald J. (1985), "More intelligent Statistical Software and Statistical Expert Systems: Future Directions," The American Statistician February, 1985. EDITING AND IMPUTATION Brian Greenberg, Bureau of the Census In talking today about an application of expert-system methods to data editing and imputation, it will be the first time that I use the words "expert system" in describing the edit and imputation program we have developed -- SPEER (Structured Program for Economic Editing and Referrals). In the past, the focus was more on describing the underlying methodology and discussing what the edit and imputation system could do for users. While preparing notes for this talk, I found that the emphasis was less on SPEER itself and more on editing and imputation as an expert system in principle. When work started on our project to develop edit and imputation software we had no intention of building an expert system. The goal was to develop techniques that corrected - survey and census response data and imputed for missing values. Looking back, one can see that as work on, this project proceeded, an expert system was evolving; and in the talk I will describe some of the steps in the development of this system. The purpose of editing and imputation is two-fold. First, if a respondent form is received and some responses are blank (item non- response), one tries to fill in missing values in order to create a complete data record for tabulation purposes. In addition, one wants to detect-erroneous responses and correct them. For example, a response may indicate a fifty-acre farm with five million bushels of wheat or a twelve-year-old grandfather. Such problems do occur in the response data; they can be data entry errors or errors at the source. Which field does one adjust and what value should replace those selected for change? When confronted with large data sets such as one has in the Census Bureau and many other Federal agencies, an automated system is a necessity. For surveys dealing with similar types of data, one would like to have general programs to avoid continually having to reinvent the wheel. On the other hand, it is desirable that an edit and imputation program incorporate has much survey-specific information as is available, and one would like the survey-specific information, to be exercised through a family of rules. An addition, one usually would like a mathematical model to ensure that rules are, applied consistently and to assist in selecting from among rules. In particular, one wants to blend survey-specific information mathematical procedures within a coherent framework. The expert- system model is a natural structure for this type of program. -85- FUNCTIONS IN AN EDIT AND IMPUTATION SYSTEM EDIT CHECKING ERROR LOCALIZATION IMPUTATION Display 36. What are the functions in an editing and imputation system? (See Display 36.) The first is edit checking. Edits are rules that detect prohibited response combinations; and it is easy to check when an edit fails, that is, a prohibited combination is encountered. Given an edit failing record, one endeavors to change as few responses as possible in order to make the remaining responses consistent. Determining fields to change is called error localization. Finally, one wants to impute in order to allocate values for non-response and replace responses deleted during the error localization process. DESIGNING AN EXPERT SYSTEM FOR EDIT AND IMPUTATION UNDERSTANDING THEORETICAL ASPECTS OF EDITING AND IMPUTATION UNDERSTANDING FACETS AND NATURE OF SUBJECT-MATTER EXPERTISE Display 37. In designing edit and imputation software along the lines of an expert system, each function that was described above should be structured in its own module (see Display 37). In a general system one wants to enter as parameters the information that will be requested of all users. Survey-specific information, particularly decision rules, can be entered in specified, well-defined places throughout the program. These rules will be different for each user. SPEER had been employed on six segments of the 1982 Economic Censuses. The edit-checking routine never changed from user to user, and the area-localization subroutine was always the same. The imputation rules, however, varied in each application. How does one impute? In general, one must rely on those with expertise about the particular survey vehicle. One works with. the subject-matter specialists to elicit well-defined decision rules based on their knowledge and experience. What does one have to do in designing edit and imputation programs along the lines of an expert system? First, one must understand the nature and the facets of the subject-matter expertise; What do the experts know? Their experience concerning the survey vehicle is extensive; it is often based on experience in the analysis of response forms and familiarity with respondents. They are knowledgeable about the survey target population, the survey form itself, and often the source of errors or non-response in data. -86- As a matter of fact, for some kinds of missing data, the survey specialist can tell you why it's missing. For example, it may be known by people working on a survey that when a certain field is blank, the respondent means zero -- they just routinely skip the question. Other blank fields will never be zero. The respondent either did not know the answer to that question or did not want to reveal it; and so that data field was left blank. Knowledge of this sort is certainly survey-specific. It cannot be gleaned through standard analysis of reported data, nor are there usually auxiliary data sets available to design models of "missingness." The subject- specialist, however, is a source of information that can be profitably utilized. Statistically-derived procedures (such as appropriate model-based imputation techniques) can be viewed and utilized as survey specific decision rules. In addition to subject-matter expertise, one must incorporate appropriate editing models. In SPEER, the error-localization process is basically a set-covering problem -- a mathematical model. One utilizes linear analysis and graph theory to select fields to delete on edit-failing records. Once these fields are deleted, the remaining fields will be mutually consistent; and then one can begin to impute. The process of imputation uses survey-specific rules provided by subject-matter experts. The knowledge base of decision rules can be organized within coherent imputation modules through which they can be applied. That is, the system goes back and forth between the subject- matter information land the mathematical model. Mathematical techniques and subject-based imputation rules are two components that one should have in an overall edit and imputation system. thinking of it that way, the mathematical procedure and the subject- matter rules can be treated as separate. One can extend the mathematical methods and revise the flow of the system as a whole, unencumbered by survey-specific considerations. The survey-specific rules can be examined in their own right, updated and revised as needed independent from the programs through which they are applied. On the other hand, the mathematical procedures and decision rules are integrated. The mathematical constructions provide a framework to assist in choosing the most appropriate decision rule and to ensure that the value imputed will pass all applicable edits. Thus, an expert system for imputation should do more than provide a vehicle for accessing expert rules. It should also provide a mathematical framework to help decide from among the rules, choosing only rules which are valid for the record under consideration. SPEER CONTINUOUS (ECONOMIC) DATA UNDER RATIO EDITS (A(l), ..., A(N)) TYPICAL RATIO EDIT L(ij)< A(i)/A(j).< U(ij) Display 38. -87- SPEER (Display 38) is an edit and imputation system designed along the lines of an expert system. SPEER was designed for economic data such as wages, assets, inventories, etc. The typical edit is a ratio of two fields, called a ratio edit. The total salaries paid to employees divided by the total number of employees should be within some reasonable range consistent with our knowledge of the industry and occupation. The amount of crop yield divided by the number of acres should be in a certain range. Ratio range checks are a very common edit in economic surveys. Given that a family of ratio edits is failed by a response record, one must select a set of fields to delete. We illustrate the workings of the error localization mechanism of SPEER on two samples below. Let the circled numbers in Display 39 represent response fields and the edges in the graph represent edit failures between the adjoining fields. For example, the value in field 6 is inconsistent with fields 1 through 5 as determined by the collection of edit rules. If we delete the value in field 6 -- that is, remove node 6 from our graph - - all edges vanish. Thus, the remaining fields are mutually consistent because there are no edges connecting the corresponding node, hence there are no edit failures between them. That is, we can delete a single field to eliminate all edit failures. The lower graph is a little more complicated. one can see, however, that by deleting the values in fields 2 and 3, the remaining fields are mutually consistent. These simple examples capture the spirit of the error localization methods built into SPEER -- a little graph theory is used to find the minimal subset of field values to delete. After error localization one has a collection of blank fields (some due to non-response, others because fields were deleted during the error localization process). The remaining fields are consistent with one another, and they must be consistent with values imputed. The program sets up a series of range specifications for a blank field taking into account the value for each valid field. -88- [GRAPHIC] \WP1489.GIF If A(n) is missing, and A(j) are consistent for all j n, L (nj) fa A (n) /A (j) -,C U (nj) so L (nj) A (j) ó A (n) ó (nj) A (j) Display 40. Every valid imputation for a missing field (field n in Display 40) must lie in the overlap of the regions determined by each fixed field on the record in order to be consistent with every other field. Once the feasible region for a missing field,is computed the program reaches into the imputation module for the value to be imputed. The first applicable rule is selected and an imputation is derived based on this rule. if the derived value falls in the feasible region, it is Accepted as a valid imputation. If not, the second rule is accessed, an imputation value is derived, etc. The value ultimately selected as the imputation will be chosen based on subject-matter based rules and will also be consistent with all other fields on the record under review, because it is forced to lie in the feasible region. This may be a good time to provide an example of what a rule sequence might look like. Suppose one is to impute for a field such as Annual Payroll (APR) on an economic census or survey. For concreteness, let us couch our discussion in terms of the 1982 Economic Censuses. The first rule might be to derive an imputation based on the 1982 Administrative Data value for APR. If the value derived does not lie in the feasible region, one might try the 1981 Administrative Data value for APR. If this value is not suitable, we pass to a third option, etc., until a valid imputation is derived. Some imputation rules can be extremely field specific. For example,.suppose some field is to be reported in tons. Assume that, the feasible region allows valid responses to be between 500 and 1,000 tons and the value 1,800,000 was reported and deleted as an error. The applicable option might be to divide the reported value by 2,000 (subject-based information that respondents sometimes report in pounds rather than tons). In this example, we would derive 900 tons and observing that this value is feasible, accept it as the valid imputation. A common error in reporting economic data is that respondents provide answers in units rather than in thousands as per instructions. For fields in which this error may occur, the first rule (when appropriate) is to divide the reported response by 1,000. The editing and imputation for the 1982 Enterprise Summary Report and the 1982 Auxiliary Establishment Report (both portions of the 1982 Economic Censuses) was performed using SPEER. In addition, SPEER was used-to process the Manufacturing, Retail, Wholesale, and, Service segments of the 1982 Economic Censuses of Puerto Rico. In each of these applications, the edit checking, error localization routines, and basic system flow are the same. Each application, however, had its own family of decision rules for imputation. Each application employed different rules based on the survey-specific fields, relation between fields, and auxiliary information. -90- How does one implement an edit and imputation system based on expert- system principles? For a given application, start with the experts. their expertise, elicits rules, and embeds those rules in the system components requiring them. Sample data is tested, performance evaluated, rules are refined as needed. Editing of economic data records at the Census Bureau is a two-phase process. All records are run through an automated edit and imputation system in batch mode. Within the automated routines, selected records are targeted as referral cases and are directed for analyst review. An optimal strategy will include automated procedures to resolve the majority of cases and individual review for establishments needing special handling. Typical referral criteria are: (1) large change to reported data; (2) amputations for large establishments; and (3) highly atypical combination of responses. The analyst reviewing a response form is a subject-matter specialist, and the review is currently a pencil-and-paper process. After analyst adjustments are made to the results of automated processing on an establishment record, the revised form is once again processed through the automated system. SPEER allows on-line, interactive processing of referral cases. Used. in this mode, the system converses with an expert using it. The human expert can override the decision rules residing in the system and replace them based on his/her expertise, and auxiliary information about the case under review. Using this system, the analyst requests a specific record and reviews the processing done by the automated system. The analyst has the original response form and hence access to information not incorporated into the rules. Based on this additional information and his/her own experience, an analyst may overrule the decision rules built into the automated system. IMPUTATION OPTIONS FOR APR A. RANGE OF APR: (250,750) B. CURRENT VALUE: 375 OPTIONS 1. REPORTED VALUE: 82 2. 1982 ADMINISTRATIVE DATA: 3. 1981 ADMINISTRATIVE DATA BASED: 4. 1977 CENSUS DATA BASED: 5. IMPUTATION AND TOLERANCE: 6. ANALYST SUPPLIED VALUE: Display 41. The display seen by the analyst looks something as in Display 41. Using Annual Payroll (APR), this display shows an acceptable range for APR from 250 to 750 (i.e., the feasible region). The current value is 375, which was derived by the automated system. The next value is the actual reported value of 82 followed by the reported 1982 Administrative Data and other candidate amputations based on 1981 Administrative Data, 1977 Economic Census Data, etc. The ordering above reflects the order in which the rule options are applied. By requiring that the range in which the imputed value must fall be consistent with all fields, plus a variety of options, the -91- analyst then has a significant amount of information at his/her disposal to assist in the decision-making process. If there is reason to believe that the most appropriate imputation value lies outside the feasible region (for example, because of explanatory notes on the form or through a call-back to the respondent), the analyst can select an imputed value outside the feasible region. A revised imputation for field APR is decided, and the analyst enters it into the data record. This value is accepted by the program, and field APR is, considered to be completed. Suppose there is a second field to be reviewed on this record (for example, Number of Employees (EMP)). Once again, the program displays on the terminal screen the feasible region for EMP, currently residing value, and candidate values for imputes derived according to each option, as it did for APR. Note, however, that each of these values is based, in part, on the new value of APR just entered by the analyst. As above, the analyst will determine an appropriate value for EMP, enter this value, and move on to the next field, if any. After all fields have been examined and adjusted if needed, the review is complete. The revised record will be consistent, and no further batch processing will. be required. The important observation from the perspective of an expert system, is that a true expert, converses with the automated expert programs in order to augment the system expertise and override decision rules as needed. Initial testing has shown that analysts have found this system easy to use. It has the potential for making their decisions in the review of establishment records less tenuous than is currently the case. Because the individual review of establishment records is a time-consuming and costly process, one can anticipate savings of time and money in the use of an "expert-system aided," on-line, interactive review process. The on-line, interactive portion of this program has not yet been put to use for actual survey processing. We are actively working with potential users to incorporate this aspect of the program in future editing and imputation processing. In summary, an edit and imputation system should blend statistical and subject-matter expertise in a coherent framework and integrate edit constraints with imputation strategy. We have described a structured system that attempts to meet these requirements and is sufficiently flexible to accommodate a variety of users. Development work continues on this system, enhancements are being made, and additional users are being identified. The references provide more information about some of the technical features of the SPEER system. References Greenberg, Brian (1981), "Developing an Edit System for Industry Statistics," Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface, 11-16, Springer-Verlag, New York. __________ (1982), "Using an Editing System to Develop Editing Specifications, " Proceedings of the Section on Survey Research Methods, American Statistical Association, 366-371. and Surdi, Rita (1984), "A Flexible and Interactive Edit and Imputation System for Ratio Edits," Proceedings of the Section on Survey Research Methods, American Statistical Association, 421-426. -92- DISCUSSION Mark Winer, Office of Management and Budget When Terry first mentioned the idea of having a panel on expert systems in a conference on Statistical Uses of Microcomputers in Federal Agencies, my question was "What do expert systems have to do with microcomputers and statistical systems?." I decided I would take a look at all the things I saw today to see how this session fits into the other sessions. The first thing we see is that you can use the IBM PC, or other personal computers, as a terminal to a mainframe. The mainframes have excellent software systems. That allows you to use both machines for the things those machines are best at. With the large amounts of data and large amounts of information you might need with an expert system, it is good to use a mainframe for most cases; but for the kind of processing and quick response you might want it's nice to use downloaded results from an expert system on a personal computer. The second reason this fit in is that, as I mentioned, the system developed by Brian Greenberg has just been adapted to personal computers. As memory capacity and storage capacity on personal computers increase, even the large systems like DATAPLOT could be extended to personal computers. The third reason that this fit in is that every couple of years there is a real hot topic in the computer field. In 1982 and 1983 it was decision support systems. If we were having this workshop in 1982 or 1983, we would have undoubtedly had a panel on decision support systems; and since in 1984 and 1985 the hot topic is expert systems, we are having this workshop and it is incumbent upon you to have a panel on expert systems. This brings up the obvious question of whether an expert system really is something new or is it just something old, another big word that people use to bring out high- priced consultants to design your system. I guess I will say that from what we have seen in these demonstrations today, expert systems are doing more than ordinary software systems. Ordinary software systems help the user do the things he normally does but make it possible for him to do those things faster and save him some of the tedious parts of the task. Both the systems we heard talked about today have the advantage of actually bringing additional knowledge to the user in that he can do what he wouldn't necessarily know how to. Expert systems show you how to locate the error that is the easier error to change if you are trying to do efficient editing. You have subject-matter expertise that analysts couldn't produce on their own. This system does that for you. It uses the subject matter of the expert to figure out how that record can be changed with some help from the system. The DATAPLOT system teaches you about the tests that are available to you as you use them. It suggests to you additional steps to do as you are doing things; so even if you are not an expert statistician, you can figure out the ways to proceed as you are, working on a problem. So, as I say, expert systems provide something beyond what we ordinarily have in software systems. They are an extension of the existing packages rather than things that stand by themselves. -93- QUESTIONS AND ANSWERS Ql: You mentioned a new fad. Isn't some of this just like a sophisticated help facility? Al: (Mr. Winer): Yes. Q2: This is this years new thing, but better help facilities have been a growing need since computers got started. I think expert syst- ems are a logical outgrowth of that. A2 (Mr. Lawton): I think expert systems are more than that, but help facilities are at least part of what we are dealing with here. I would say what makes the help facility more sophisticated is that they have some expertise built into them, that is based on knowledge of where in the program the user calls the help facility. So what you say is partly true, but I think there is some intelligence built into the back of the help facility in the expert system that wouldn't be there in a more conventional system or from just reading, the help file. A2 (Mr. Ireland): There probably are two other issues. First, the help system can be changed incrementally as you come to understand what kind of help you need. The idea of rules makes it easy to develop small help modules that are added to a system that already has a help facility. Second, for some of Brian's things, it isn't a user help facility, but a specification of how to handle a particular piece of data. So, the help facility might never be seen by analysts unless they ask to see it, but it would be used to make a proper kind of modification to the data. A2 (Mr. Greenberg): Expert systems can be run in batch mode once ex- pertise is built into it, and that bears on the use of the help facility in a batch mode. Q3: I am curious about some of the details of the DATAPLOT and the editing and imputation system. Let me start with the editing system since that is fresh in your memory. How much memory on the IBM XT does your system take up? A3 (Mr. Greenberg): I really don't know. I haven't been doing the actual transfer to the XT. Q4: But you could fit it into 64OK? A4 (Mr. Greenberg): Yes, plenty of data is on one floppy disk. Q5: You said it was easy to use. What did you say--a half day? How long? A5 (Mr. Greenberg): I would say a half day working with somebody like myself or someone familiar with it. -94- Q6: We do surveys a lot and they are typically tedious -- we have people coming in to do error checking and editing in a rather primitive way, so I think your product would be very useful to us and a lot of other people. What would be a good way for us to learn more about it? A6 (Mr. Greenberg): Drop me a line or give me a call. Q7: The degree of statistical information you obviously have in your head goes well beyond that of everyone in our office. The only fear I would have would be whether those of us who have a much lower level of statistical knowledge could still make use of DATAPLOT. What do you think about that? A7 (Mr. Filliben): That is a general problem, and one of the displays dealt with varying experience. This addresses the point of whether this is an extended help facility. We tried to make sure that the menus that came up would be a part of the education process too--a tutorial, if you will. We have had people use this expert system who have very limited statistical background, and they have come, out with good results. It's a matter of learning, and I think the expert systems are at the point now where it's nice to have a machine that has an expert system, but it's also nice to have some statisticians and other consultants around who can augment them. One thing we did not mention was references. Where does someone go if he really wants to read up on graphical or residual analysis, for example? That is one command as far as DATAPLOT is concerned. There should be a reference command. It's not in there yet, but there is a body of literature that's out there that has a lot of details. If people want to go in and fill up their own base knowledge, they should have access to this base. It is very much, as you say, an extended help. There are lots of different ways these various systems can be of help because there are a lot of different ways we can have deficiencies in our own knowledge. Q8: What kind of mainframe are they working with? A8 (Mr. Filliben): All the major mainframes. UNIVACS. one, and in fact the default machine, is the VAX 11/780. IBM/PC's. The Pentagon has it on a Honeywell Multics system. PERKIN-ELMERS. PRIMES. The only machine we had difficulty putting it up on was the CYBER machine. That problem will disappear because we are getting a CYBER machine and we will be forced to address that problem. They have a hardware restriction on memory. In UNIVAC you run into an overlay problem. In terms of whether it would download to a PC or could be put up on a PC, you would need about-a half a megabyte of memory. Small machines--micros--are expanding to the point where it's a real possibility to put DATAPLOT on a PC. Q9: You say that NTIS sells this? A9 (Mr. Filliben): National Technical Information Service sells it for $1200 --a one-time-only fee. You get the source code. The source code on the file is 12 megabytes, so you have to have somewhere where you can put it. Q10: Did you write this yourself? A10 (Mr. Filliben): Yes. -95- Q11: How long did it take? All (Mr. Filliben): We started back in 1972 with a software system called DATAPAK which is free from NBS. That sort of got us into the problem. By 1975 it became clear that interactive systems were becoming more important. By 1977 we had the first DATAPLOT running, and things have essentially been the same since then. We augmented it to include the expert system. Mr. Winer: Perhaps less a question than, a comment. At the end of the last session, I asked the panelists how their decentralized and spreadsheet-type statistical systems insure or assure data integrity and adherence to the statistical standards. Here I think we have had two presentations in which one could in a sense say "Hey, that answers the question!". If people start using systems like Brian's, they will have more data integrity; and if one starts using systems more like Jim.'s, one could have more adherence to agencies present standards. I would like to take this opportunity to thank Terry Ireland who chaired this session, but who is also the Chairman of the Subcommittee of the Federal Committee on Statistical Methodology who organized this entire workshop, including this session. We thank him and thank you all for coming. -96- Appendix Announcement of Workshop on Statistical Uses of Microcomputers in Federal Agencies The Subcommittee on Statistical Uses of Microcomputers in Federal Agencies of the Federal Committee on Statistical Methodology is sponsoring a one-day workshop on April 24, 1985, to discuss with other Federal employees selected topics on statistical uses of microcomputers. The workshop will be held at the IRS Auditorium, 1111 Constitution Avenue, N.W., 7th floor, from 9:15 a.m. to 4:30 p.m. The agenda and speakers are as follows: 9:15 a.m. WELCOME AND INTRODUCTION Chair: Maria Gonzalez, Office of Management and Budget Arrangements: Linda Taylor, Internal Revenue Service 9:20 a.m. PLANNING Chair and Discussant: Larry Cox, Bureau of the Census Rapporteur: Fred Cavanaugh,.Bureau of the Census Microcomputer technology has much to offer statistics, and many statisticians have become microcomputer users at work and at home. This technology and the keen interest of statisticians in it provide statistical agencies with many opportunities, each bringing with it responsibilities for planning, implementation and evaluation: if every statistician (programmer/secretary) in an agency wants a microcomputer, who should have them? For what purposes can/should microcomputers be used? In what configuration? At what cost (overall/per user)? How will this technology coexist with central ADP services? What policy decisions need to be made -when -- by whom? In this session on planning, we will explore such questions through discussion, focusing on three different and successful approaches to these problems -- those adopted by the Census Bureau, the National Security Agency and the Bureau of Labor Statistics. Speakers: Ronald R. Swank, Bureau of the Census Kathy Schnaubelt, National Security Agency Peter Stevens, Bureau of Labor Statistics DISCUSSION 10:45 a.m. COFFEE BREAK -97- 11:00 a.m. ELECTRONIC DATA DISSEMINATION Chair: Ken Berkman, Bureau of Economic Analysis Rapporteur: Jay Casselberry, Energy Information Agency This session is a panel discussion of the different approaches to.electronic data dissemination by various Federal agencies. Different approaches will be described with particular emphasis on the factors determining an agency's approach to data dissemination and the problems encountered in their implementation. The experience gained by these agencies will be presented: National Technical Information Service (NTIS) distribution of microcomputer floppy disks; Census CENDATA system; and the U. S. Department of Agriculture's current development of an on-line system. Speakers: Stuart Weisman, National Technical Information Service Barbara Aldrich, Bureau of the Census Roxanne Williams, Department of Agriculture ***DISCUSSION*** 12:30 p.m. LUNCH 1:15 P.M. APPLICATIONS OF MICROCOMPUTERS Chair: Ron Steele, Department of Agriculture Rapporteur: Tom Nagle, Internal Revenue Service This session is a panel discussion of statistical applications of microcomputers. The utility and weaknesses of applications software and operating systems will be discussed. Some examples involve interfacing mainframe and microcomputers. Issues to be addressed include an assessment of the utility of microcomputers at present, the future utility in light of new hardware and software technologies, and considerations regarding data integrity, security and accessibility. Speakers: Linda Atkinson, Department of Agriculture Gary Nelson, Department of Agriculture Rick Hayes, Internal Revenue Service Brian Carney, Department of Agriculture Paul Dobbins, Department of the Treasury Dick Shively, Department of Agriculture. -98- DISCUSSION 2:45 p.m. COFFEE BREAK 3:00 p.m. EXPERT SYSTEMS Chair: Terry Ireland, National Security Agency Discussant: Mark Winer, Office of Management and Budget Rapporteur: Norman Glick, National Security Agency Recently, the idea of incorporating techniques used by professional experts into software has become popular. This session will introduce the basis for expert-system methodology and give several practical examples of expert systems with statistical applications that are currently in use. Speakers: George Lawton, Army Research Institute James Fillben, National Bureau of Standards Brian Greenberg, Bureau of the Census ****DISCUSSION**** 4:30 p.m. ****ADJOURN**** -99- Reports Available in the Statistical Policy working Paper series 1. Report on Statistics for Allocation of Funds (Available through NTIS Document Sales, PB86-211521/AS) 2. Report on Statistical Disclosure and Disclosure-Avoidance Techniques (Available through NTIS Document Sales, PB86211539/AS) 3. An Error Profile: Employment as Measured by the Current Population survey (Available through NTIS Document Sales PB86- 214269/AS) 4. Glossary of Nonsampling Error Terms: An Illustration of a Semantic Problem in Statistics (Available through NTIS Document Sales, PB86-211547/AS) 5. Report on Exact and Statistical Matching Techniques (Available through NTIS Document Sales, PB86-215829/AS) 6. Report on Statistical Uses of Administrative Records (Available through NTIS Document Sales, PB86-214285/AS) 7. An Interagency Review of Time-series Revision Policies (Available through NTIS Document Sales, PB86-232451/AS) S. Statistical Interagency Agreements (Available through NTIS Document Sales, PB86-230570/AS) 9. Contracting for Surveys (Available through NTIS Document Sales, PB83-233148) 10. Approaches to Developing Questionnaires (Available through NTIS Document Sales, PB84-105055/AS) 11. A Review of Industry Coding Systems (Available through NTIS Document Sales, PB84-135276) 12. The Role of Telephone Data Collection in Federal Statistics (Available through NTIS Document Sales, PB85-105971) 13. Federal Longitudinal Surveys (Available through NTIS Document Sales, PB86-139730) 14. Workshop on Statistical Uses of Microcomputers in Federal Agencies (Available through NTIS Document Sales, B87-166393) Copies of these working papers may be ordered from NTIS Document Sales, 5285 Port Royal Road, Springfield, VA 22161 (703) 487-4650