NATIONAL ESTIMATES OF AN AUDITED VARIABLE WITH A LARGE NUMBER OF MISSING VALUES



Ellen Hertz, Ph.D.

National Highway Traffic Safety Administration



Abstract

The National Highway Traffic Safety Administration maintains the National Automotive Sampling System Crashworthiness Data System. The NASS CDS is a stratified probability sample of all crashes involving a passenger vehicle (passenger car, light truck or van or sports utility vehicle) that required towing due to damage. One of the variables collected on the vehicle level is the rollover initiation type (ROLINTYP). It takes the value 0 for non-rollovers, 3 for turn-overs (also known as untripped rollovers) and 1 for trip-overs, as well as other values corresponding to other rollover modes, such as climb-over and collision with another vehicle. A turn-over is a rollover for which there is no obvious cause other than normal surface friction. A trip-over is a vehicle whose rollover was initiated by contact with a formation such as a ditch or curb. A missing value of ROLINTYP denotes a rollover of unknown origin. There are a number of these and they are assumed to be typical of the other rollovers since inability to inspect the site for more details generally is independent of the circumstances precipitating the rollover.



In 1998, an independent audit of ROLINTYP was conducted by the NASS Zone Centers for a sample of rollovers in NASS that originally took place from 1992 to 1996. A variable AUDTYP was defined that reflects the audit's determination. For rollovers, AUDTYP takes the values "turn-over" and "other rollover" which includes trip-over. All the vehicles with ROLINTYP = 3 (turn-over) were audited and received a value of AUDTYP. Some were changed to "other rollover" and some were confirmed to be turn-overs. A smaller fraction of the vehicles that were initially classified as trip-over were audited. Of these, one was changed to turn-over. No vehicles initially classified other than turn-over or trip-over were audited. That includes the ones that had unknown rollover mode.



The goal of this analysis is to generate national estimates of the annual number and proportion of rollovers that are untripped, along with credible estimates of the standard errors of these estimates. A multistage simulation is performed. First, values of ROLINTYP are imputed for the rollovers for which it is missing. This is done, using SAS, by first selecting the probabilities that an unknown is a trip-over, a turn-over or other. These probabilities are selected from a distribution based on the population of knowns, with variance based on the sample size. Using these probabilities, actual or imputed values of ROLINTYP are created for all rollovers. Next, the probabilities for the values of AUDTYP given ROLINTYP are chosen from a distribution based on the ones that are known, and then each rollover is given an imputed value for AUDTYP based on these probabilities. A SUDAAN crosstab is performed resulting in point estimates of the distribution of AUDTYP and the survey standard errors.



In order to estimate the component of variance due to the distribution of the unknown values of ROLINTYP and to the imputed values of AUDTYP, this process is repeated 5 times. This results in imputed values AUDTYP1-AUDTYP5. For each of these, a SUDAAN crosstab is performed resulting in point estimates and standard errors. The sample variance between the 5 point estimates is then computed. It is estimated that there are about 7,900 untripped rollovers annually with a standard error of about 2,340. These account for around 3.7 percent of all rollovers.



Keywords



Traffic data; survey; imputation; simulation.



Background



The National Highway Traffic Safety Administration maintains the National Automotive Sampling System Crashworthiness Data System. The NASS CDS is a survey of all crashes involving a passenger vehicle (passenger car, light truck or van or sports utility vehicle) that required towing due to damage. One of the variables collected on the vehicle level is the rollover initiation type (ROLINTYP). It takes the value 0 for non-rollovers, 3 for turn-overs (also known as untripped rollovers) and 1 for trip-overs, as well as other values corresponding to other rollover modes, such as climb-over and collision with another vehicle. A turn-over is a rollover for which there is no obvious cause other than normal surface friction. A trip-over is a vehicle whose rollover was initiated by contact with a formation such as a ditch or curb. A missing value of ROLINTYP denotes a rollover of unknown origin. There are a number of these and they are assumed to be typical of the other rollovers since inability to inspect the site for more details generally is independent of the circumstances precipitating the rollover. Table 1 displays the distribution of the variable ROLINTYP in NASS 1992-96. These are unweighted sample sizes, not national estimates.



Table 1



Frequency of ROLINTYP Codes in NASS 1992-96

Description Code

SS

Non-rollover 0

34,718

Trip-over 1 1,701
Flip over 2 281
Turn-over 3 267
Climb over 4 79
Fall over 5 274
Bounce over 6 360
Collision with other vehicle 7 342
End over end 98 28
Other 8 47
Missing or unknown 9,U 199



In 1998, an independent audit of ROLINTYP was conducted by the NASS Zone Centers for a sample of these NASS rollovers. All of the 267 that were originally coded as turn-over (untripped) were audited. Also, 43 of the 1,701 of the vehicles originally coded as trip-overs were audited. The Zone Centers recommended changing 190 of the turn-overs to other rollover initiation types and 1 of the 43 audited trip-overs to turn-over. No vehicles initially classified as other than turn-over or trip-over were audited. That includes the ones that had unknown rollover mode. The goal of this analysis is to generate an estimate of the actual annual number and proportion of rollovers that are untripped, based on the assumption that the audited determinations are all correct.



Methodology

A variable AUDTYP was defined that takes the values "turn-over", "other rollover" and "non-rollover". The audited rollovers and the non-rollovers (ROLINTYP = 0) had known values of AUDTYP. Since all of the 267 vehicles with a ROLINTYP of 3 were audited, they all received a value of AUDTYP. The 190 of these that were changed were "other rollover" and the rest were "turn-over". Similarly, the 43 trip-overs (ROLINTYP of 1) that were audited received values of AUDTYP of which one was "turn-over" and the rest "other rollover". It was assumed that untripped rollovers were very unlikely to have been assigned a value of ROLINTYP other than 1, 3 or unknown. That, is an untripped rollover could possibly be mistaken for a trip-over but not, say, a collision or a climb over. Also, a vehicle classified as a non-rollover always is really a non-rollover. The variable ROLINTYP was therefore, collapsed into the categories "turn-over", "trip-over", "other rollover", "non-rollover" and "unknown" and vehicles with ROLINTYP other than 0,1,3, 9 or U were assigned an AUDTYP of "other rollover". The 199 passenger vehicles with unknown ROLINTYP (9 or U) rollover are assumed to be typical of the other rollovers, that is, they are cases for which information was no longer available at the time of inspection. The following steps were followed:



(1) Each of the unknowns is given an imputed value of ROLINTYP: Referring to Table 1, the frequencies of known (rollover) values of ROLINTYP are 267 turn-overs, 1,701 trip-overs and 1,411 other known types. A rollover vehicle with an unknown ROLINTYP is called a "turn-over" with probability p1, a "trip-over" with probability p2 and an "other rollover" with probability 1- p1 - p2. Here, p1 has been selected from a normal distribution with mean p1_hat 267/3379 and variance = p1_hat(1-p1_hat )/3379. Similarly, p2 has been selected from a normal distribution with mean p2_hat 1701/3379 and variance = p2_hat(1-p2_hat )/3379.



(2) Each of the rollovers that does not have a value of AUDTYP is then given an imputed one: Given ROLINTYP = "turn-over", the rollover has AUDTYP = "turn-over" with probability p3 where p3 has been selected from a normal distribution with mean p3_hat 77/267 and variance =p_hat(1-p3_hat )/267. Given ROLINTYP = "trip-over", the rollover has AUDTYP= "turn-over" with probability p4 where p4 has been selected from a normal distribution with mean p4_hat 1/43 and variance = p4_hat(1-p4_hat )/43. Other rollover types are assigned an AUDTYP of "other rollover" and non-rollovers are assigned "non-rollover" with probability 1. Therefore, each vehicle has an imputed value of AUDTYP that is either "turn-over","other rollover" or "non-rollover".



(3) A SUDAAN PROC CROSSTAB of AUDTYP is conducted using the NASS CDS survey weights divided by 5 on the subpopulation of rollovers. This produces national estimates of the annual number and proportion of untripped rollovers as well as estimates of their standard errors.



(4) Steps (1)-(3) are repeated 5 times. The results are displayed in Table 2.





Table 2



Estimated Annual Untripped Rollovers 1992-96



Simulation Est. Number of Untripped Ros Standard Error of Est. Number Percentage of

all Ros

Standard Error

of Percentage

1 9,685 2,867 4.51 1.10
2 7,347 1,743 3.42 0.92
3 7,368 1,505 3.43 0.79
4 7,485 1,895 3.49 0.94
5 7,447 2,254 3.47 1.16
Average 7,866 2,053 3.67 0.98
Sample Std. 1,018 0.47
Total Std. 2,340 1.10







Discussion



An estimate of a population parameter, in this case the number of untripped rollovers among towed passenger vehicles from 1992-96, can be regarded as a random variable whose probability distribution derives from the survey sampling plan ( Särndal et al, 1992). Let Y be that estimate and S the subset of the population of crashes involving towed passenger vehicles from 1992-96 that was selected. Y is obtained by a two-step process: first a finite sample S is selected from the population according to the NASS sampling plan. Next, Y is obtained as a random function of S.



If (S,Y) is any pair of random variables with a joint distribution and for which Y has a finite variance, it is well known that var(Y) = E[var(Y|S)] + var[E(Y|S)] (Lindegren, 1968). Since SUDAAN does not "know" about the distribution and imputation process, the estimated variance of Y that SUDAAN returns is an estimate of var[E(Y|S)], that is, a "between samples" variance. The "within the sample" variance is var(Y|S) which is an estimate of E[var(Y|S)]. Referring to Table 2, the "between samples" variances are estimated by the average of the SUDAAN variances and the "within the sample" variances are estimated by the sample variance of the point estimates. The final estimates are that there are an annual average of 7,866 untripped rollovers with a total standard error of 2,340. The constitutes 3.67 percent of the passenger vehicle rollovers with a standard error of 1.10.



The original variable ROLINTYP is assumed to be missing at random (MAR) (Little and Rubin,1987). However, the variable of interest, AUDTYP, is highly correlated with ROLINTYP and is not MAR since the decision to audit was not random. Therefore, the vehicles with unknown ROLINTYP are typical of the population but not of the audited population.



This paper describes a "brute force" approach. More sophisticated analytic approaches, such as Bayesian models, would be of interest.



References



(1) Statistical Analysis with Missing Data, Roderick A. Little and Donald B. Rubin, 1987

(2) The National Automotive Sampling Sytem, 1992-96, NHTSA

(3) The SAS System

(4) The SUDAAN System

(5) Model Assisted Survey Sampling, Carl-Erik Särndal, Bengt Swennson, Jan Wretman, 1992

(6) Statistical Theory, Second Edition, B.W. Lindegren, 1968