Toward More Effective Testing for High Assurance Systems

Herbert Hecht and Myron Hecht	Dolores Wallace
SoHaR Incorporated Beverly Hills, California	National Institute of Standards and Technology Gaithersburg, Maryland

Abstract

The objective of this paper is to reduce the cost of testing software in high assurance systems. It is at present a very expensive activity and one for which there are no generally accepted guidelines. A part of the problem is that failure mechanisms for software are not as readily understood as those for hardware, and that the experience of any one project does not provide enough data to improve the understanding. A more comprehensive attack on the high cost of software test requires pooling of fault and failure data from many projects, and an initiative by the National Institute of Standards and Technology (NIST) that can furnish the basis for the data collection and analysis is described.

Introduction

Testing to establish the dependability of high assurance systems (systems in which certain failures can have disastrous consequences) is an expensive activity and one for which there are no generally accepted guidelines. Failures in these systems are rare, and therefore a given project will not have data for assessing development or test methodologies or for establishing objective test termination criteria. This paper does not claim to overcome all of these difficulties, but it hopes to point the way to much more effective testing of such systems.

We show that current software test techniques are not well suited to deal with the failure mechanisms that typically prevail in high assurance systems. The direction toward a more targeted approach can be suggested on the basis of current information, but much more data, collected from a broad spectrum of applications, will be required to formulate sound policies and for generating standards. For this reason the major message of this paper is an invitation to participate in a data collection effort that is being started under the auspices of the National Institute of Standards and Technology (NIST). The current status of this effort, including protection of privacy of contributed data, is described.

Where and How Software Fail

Software failure and related terms are used here as a short form for saying "failures in systems controlled by software". Failures in high assurance systems are due to the coincidence of two circumstances: a deficiency in the code, referred to as a software fault, and the occurrence of a trigger, a data or a computer state that causes execution of the fault to result in a failure. The role of the trigger is particularly important for the testing of high assurance systems where the 'first time around' failures have been eliminated.

Hardware failure mechanisms are generally well understood, and that has permitted the development of process controls to produce reliable components, of test techniques for culling parts that are of questionable quality, and of fault tolerance provisions to deal with the consequences of residual failures. Our knowledge about software failure mechanisms in high assurance systems is much more limited because the small number of failures seen on a single project. Software reliability is now frequently the limiting factor in the application of computers in critical systems as evidenced in the agenda of practically every meeting and workshop on fault tolerant and high integrity computing. Among twenty paper sessions at the 1995 (25th) International Symposium on Fault Tolerant Computing, twelve were concerned with specific software reliability issues, and an additional six dealt with architectural and communications issues that were also largely software related {FTCS95].

Faulty software handling of hardware exception conditions was recognized as a major cause of failures in a critical computer system many years ago [VELA84]. Since then, studies have explicitly pointed to the infrequently executed sections of code as the most likely ones to contain faults after initial debugging has been completed [KANO87, HECH94]. The first of these related to a French phone switch which contained 'telephony' (switching functions) and 'defense' (exception handling) code. The size of the latter code was 20% less than the former, but it contributed about 2.5 times more to the failures that brought the system down. The subject of the second study was a version of the NASA/JPL Deep Space Network (DSN) results of which are summarized in Table 1. Project personnel identified five segments each of frequently executed (FC) and rarely executed (RC) code. The latter category included exception handling, redundancy management, initialization and calibration routines.

Table 1. Comparison of Frequently and Rarely Used Code

Characteristic	FC	RC
Program size, KSLOC	185.1	144.3
No. of faults found in test	893	135
Fault density from test	0.0048	0.0016
Failures during first year of operation	33	42
Failures during last 4 months of year	9	32

The fault density computed from test results indicates that the rarely executed segments had fewer faults than frequently executed ones, but in operation the order reversed, even though the frequently executed segments experienced very much more execution time. As might be expected, the failures in the frequently executed code occurred earlier during the operational period; rarely executed code took much longer to get debugged and probably still contained many residual faults at the end of the first year. At least in this case the test strategy did not provide high coverage for rarely executed segments. A more suitable test approach is suggested in [LI94].

If faults that remain after test and initial operation are concentrated in rarely executed code, it follows that the triggers are rare events that cause these segments to be executed under conditions that differed from those anticipated by the developers. The emphasis now shifts from code segments that are infrequently executed to specific rare events that produce the failures and that can be studied in order to produce better test cases. A study of Space Shuttle Avionics (SSA) software failures [HECH94] clearly shows that rare events are the predominant cause of serious failures in well-tested software. In this project the consequences of failure (severity) were classified on a scale of 1 to 5, where 1 represents safety critical and 2 mission critical failures with higher numbers indicating successively less mission impact. Rare events were the exclusive cause of identifiable failures in Category 1 (safety critical). In all safety and mission critical categories rare event failures were overwhelmingly due to multiple rare events, typically two. The average of rare events identified as causes of failure on a report that listed any rare event among the causes was 1.95. At this point it is appropriate to ask "How often have we concentrated on test cases that involved more than one rare event at a time?" The thoroughness of final testing in the shuttle program surfaced weaknesses which probably would have been detected in most other situations only after they caused operational failures. The study also showed that failures in Categories 4 and 5 are not primarily due to rare events, and that in Category 3 the importance of rare events is much less than in the more critical categories. One explanation for that may be that program segments which cannot cause critical failures are tested less thoroughly and therefore arrive at the final test still containing some of the faults that cause failures under non-rare conditions.

One further indication of the failure potential when a program encounters more than one rare condition comes from a research program sponsored by the NASA Langley Research Center to investigate the benefits of N-version programming [ECKH91]. Twenty versions of a redundancy management program, written in Pascal, were developed at four universities (five versions at each) from the same requirements, and the versions were then tested individually to establish the probability of correlated errors that would defeat the benefits of N-version fault tolerant software. The specifications for the program were very carefully prepared and then independently validated to avoid introduction of common causes of failure. Each programming team submitted their program only after they had tested it and were satisfied that it was correct. Then all 20 versions were subjected to an intensive third party test program. The objective of the individual programs was to furnish an orthogonal acceleration vector from the output of a non-orthogonal array of six accelerometers after up to three arbitrary accelerometers had failed. Table 2 shows the results of the third-party test runs in which an accelerometer failure was simulated. The software failure statistics presented below were computed from Table 1 of the reference.

Table 2. Tests of Redundancy Management Software

No. of prior anomalies	Observed Failures	Total Tests	Failure Fraction
0	1,268	134,135	0.01
1	12,921	101,151	0.13
2	83,022	143,509	0.58

The number of rare conditions (anomalies) responsible for failure was one more than the entry in the first column (because an accelerometer anomaly was simulated during the test run, and it is assumed that the software failure occurred in response to the added anomaly). In slightly over 99% of all tests a single rare event (accelerometer anomaly) could be handled as indicated by the first row of the table. Two rare events produced an increase in the failure fraction by more than a factor of ten, and the majority of test cases involving three rare events resulted in failure. A significant conclusion from this work is that test cases containing multiple rare conditions greatly increase the probability of finding latent faults, including those not due to the multiplicity of conditions.

Prevailing Test Practices

The two classical test strategies that are still very much in use today are functional (requirements based) and structural (code based) testing. Random testing can be considered a variant of either of these, depending on whether the random selection was among requirements or code segments. In an attempt to increase the "efficiency" of test, it has been suggested that test effort be allocated to segments in accordance to their expected operational frequency of invocation. Such a strategy will result in very minimal testing of exception handling and, if employed in a high assu-rance environment, can lead to disaster [LI94].

Figure 1. Branch coverage as a function of Test cycles

An interesting glimpse into test practices in such an environment is provided in Figure 1 which shows branch coverage as a function of test cycles from tests of plant systems under three representative test strategies: acceptance test (functional), plant simulation (direct manipulation of the input data), and uniform random selection (among program variables). All three strategies achieve about 50% coverage on the first test case. The acceptance test progresses very slowly until about cycle 30, rises to a new plateau near 75% coverage that persists to cycle 200, and then rapidly rises to coverage above 95%. A scenario consistent with this pattern is that the first 30 cycles were used for very routine test cases, that mildly unusual conditions were then input to cycle 200, and that thereafter the program was subjected to rare conditions that accessed previously unused branches.

The plant simulation produces initially a higher coverage than the acceptance test, but then plateaus at about the 65% level and never goes higher. The explanation is that the plant simulator, though very capable as far as plant malfunctions are concerned, could not generate conditions that represent computer or data link failures. The uniform random strategy yields steadily rising coverage from the beginning, reaching 95% after about 50 cycles. This strategy produced a test case mix of routine, mildly unusual, and very unusual conditions, and some participants in the experiment concluded these results show random testing to be a superior methodology to "systematic" test case generation [BISH90]. An alternative interpretation of the figure is that systematic approaches need to be much more aggressive in selecting data sets that challenge the program. One possible approach to identification of targets for more focused testing has been described at HASE'96 [VOAS96], but that paper also recognizes that much work needs to be done to improve testing of high assurance systems. To that end more information on what causes failures in critical programs is needed, and because such failures occur infrequently, data from multiple sources are required for a statistically valid analysis. This is the motivation for the NIST data repository effort described below.

Framework of the NIST Program

The Standard Reference Materials Program at NIST assists science and industry in achieving meaningful measurements through the production, certification, and issuance of Standard Reference Materials (SRMs). The Information Technology Laboratory (ITL) will provide reference data for software. The ITL will provide tests and test methods to ensure a usable, scaleable, interoperable, and secure information technology infrastructure. One of the primary goals of ITL is to assure that U.S. industry, academia, and government have access to accurate and reliable test methods, data, and reference materials. In particular, a newly formed project will collect, analyze, and disseminate data on software failures.

The need of software researchers for project data and of software developers for tools to collect and analyze their test data are very closely related to each other and to ITL's mission. Consequently, NIST has initiated a project called error, fault, and failure data collection and analysis (EFF). The purpose of the EFF project is to provide reference data on software errors, faults, and failures. Data from multiple projects are needed to develop these and other benchmarks and to provide researchers with sufficient sample sizes to develop new analytic methods. Projects and their sponsoring companies need similar data to understand where specific errors types are likely to occur and the frequency with which they occur, and developers of test tools need similar data. Developers may locate troublesome parts of their programs, adjust their development methods and adapt their testing processes. The project test schedule and product quality could benefit from reference profiles.

Scope of Data Collection and Analysis

The plan for EFF is aggressive; a primary risk is that few data will be contributed because of proprietary and privacy concerns. NIST's reputation for objectivity and the ability to protect proprietary information may help to overcome this reluctance. Further, EFF tools will provide a service to the contributors that will make collecting the data worthwhile. Another risk, normalizing data from diverse environments, is partially accommodated through the Project Information File which is described later.

The tools developed under these guidelines are accessible on the World-Wide Web (WWW) and can be ported to the contributing organization's server. There are separate data structure and collection tools for fault data and for failure data. Fault data is collected throughout the software life cycle, while failure data (which requires execution) comes from test or service. Once a failure is analyzed, it usually generates a fault record. The EFF Fault tool is ready and the EFF Failure tool is in implementation. The tools are similar in design but the content and services are different. To minimize contributor effort, much input is pre-defined through use of pull-down menus and radio buttons. Text input is limited.

Once data are available, NIST will validate and remove any organizational identification for the data, and then make the data publicly available through another WWW system. If requested, the organization's contact will be provided a key to identify its own data within the EFF data base. The NIST public data viewing and analysis tool is in design. Currently, information on the WWW page at / links to a taxonomy-based system called Reference Information for Software Quality (RISQ) [NIST97]. RISQ provides access to artifacts for software quality such as documents, tools and code. The EFF data will be part of RISQ as a data artifact and will be accessible through an SQL/ORACLE system residing on ~~the hissa~~ a server open to the public. Contributed data will be sanitized on a non-publicly-accessible system before being entered into the ORACLE system. We will provide interfaces to publicly-available tools for examining the data. Users of the data will also be able to download and analyze the data with their own tools.

Data collected by an organization remains available to that organization exclusively. Fault management capability allows user-constructed queries on the fault data submitted. The basic structure for the queries allows sorts on open and resolved faults and dates and effort for discovery and resolution, relative to any selection of the fault's attributes. The structure of the data collection and analysis tools is shown in Table 3.

Table 3. Functions and Structures of EFF Tools

Function

Structure

Project Information

1 record, submitted once, may be edited

Fault Data

1 record per fault

Failure Data

1 record per failure

Fault management functions

selected by contributor

Within an organization, features like development processes, governing standards, programming language, and quality practices are under-stood. When data are collected across many projects and different organizations, these elements are likely to differ as are characteristics of the organization. The Project Information identifies these differences and permits normalizing data from many environments. Most fields in this file are concerned with the type of software and its characteristics. The Fault and Failure records contain three types of information: general (project name and fault or failure record number), descriptive parameters (type, when discovered) and data describing correction of the fault.

Conclusions

Better understanding of failures in software for high assurance applications will lead to more effective development and test methodologies, to better fault tolerance and mitigation, and will broaden the application of computers in critical

systems. There is much evidence that current test strategies are not well suited for uncovering residual faults in critical software. Presently available data indicate that more intensive testing of exception handling provisions will pay handsome dividends but much more specific guidance could be developed with a broader data availability. Relevant failures are infrequent in a given project, and therefore pooling of data is essential for a technically sound attack on the problems. NIST has undertaken to collect such data, safeguard privacy and proprietary concerns of the contributors, and make the data as well as tools for their analysis accessible to industry and academia. This presents a real opportunity to advance our knowledge of software failure mechanisms and thereby to enhance the reliability and safety of computer applications in critical systems. It is hoped that organizations and individuals will realize that it is to their and the community's benefit to contribute currently available and future data.

References

[BISH90] P. G. Bishop, ed., Dependability of critical computer systems 3 -- Techniques directory, Elsevier Applied Science, ISBN 1-85166-544-7, 1990

[DAHL90] G. Dahll, M. Barnes, and P. Bishop, "Software Diversity -- A way to Enhance Safety?", Proc. Second European Conference on Software Quality Assurance, Oslo, May 1990

[ECKH91] D. E. Eckhardt, A. K. Caglayan, J. C. Knight, et al., "An experimental evaluation of software redundancy as a strategy for improving reliability", IEEE Trans. Software Engineering, vol 17 no 7, July 1991, pp. 692 - 702

[FTCS95] Digest of Papers, Twenty-fifth International Symposium on Fault-Tolerant Computing, Pasadena CA, June 1995, IEEE Computer Society Press

[HECH94] Herbert Hecht and Patrick Crane, "Rare Conditions and their Effect on Software Failures", Proceedings of the 1994 Reliability and Maintainability Symposium, pp. 334 - 337, January 1994

[HECH96] Herbert Hecht and Dolores Wallace, "Project Data to Support High Integrity Methods, " Nuclear Plant Instrumentation, Control and Human Interface Technologies Conference May 6-9, 1996, Pennsylvania State University, State College, PA.

[LI94] N. Li and Y. K. Malaya, "On input profile selection for software testing", Proceedings of ISSRE 94, pp. 196-205

[KANO87] K. Kanoun and T. Sabourin, "Software Dependability of a Telephone Switching System", Digest, FTCS-17, June 1987

[NIST97] Charles B. Weinstock and Dolores R. Wallace, NISTIR 5954, "RISQ: A WWW-Based Tool for Referencing Information on Software Quality," U.S. Department of Commerce, Technology Administration, National Institute of Standards and Technology, January 1997.

[VELA84] P. Velardi and R.K. Iyer, "A Study of Software Failures and Recovery in the MVS Operating System", IEEE Trans. Computers, Special Issue on Fault Tolerant Computing, VO. C-33, No. 7, July 1984

[VOAS96] J. Voas, F. Charron, and K. Miller, ""Investigating Rare-Event Failure Tolerance: Reductions in Future Uncertainty", Proc. IEEE High-Assurance System Engineering Workshop (HASE'96), October 1996

Function	Structure
Project Information	1 record, submitted once, may be edited
Fault Data	1 record per fault
Failure Data	1 record per failure
Fault management functions	selected by contributor