Skip to contentUnited States Department of Transportation - Federal Highway AdministrationSearch FHWAFeedback

Pavements

<< PreviousContentsNext >>

7. VERIFICATION PROCEDURES

INTRODUCTION

As part of the acceptance procedures and requirements, one question that must be answered is "Who is going to perform the acceptance tests?" The agency may either decide to do the acceptance testing, assign the testing to the contractor, have a combination of agency and contractor acceptance testing, or require a third party to do the testing.

The decision as to who does the testing usually emanates from the agency's personnel assessment, particularly in the days of agency downsizing. Many agencies are requiring the contractor to do the acceptance testing. This is at least partially because of agency staff reductions. What has often evolved is that the contractor is required to perform both QC and acceptance testing. If the contractor is assigned the acceptance function, the contractor's acceptance tests must be verified by the agency. The agency's verification sampling and testing function has the same underlying function as the agency's acceptance sampling and testing-to verify the quality of the product. Statistically sound verification procedures must be developed that require a separate verification program. There are several forms of verification procedures and some forms are more efficient than others. To avoid conflict, it is in the best interests of both parties to make the verification process as effective and efficient as possible.

The sources of variability are important when deciding what type of verification procedures to use. This decision depends on what the agency wants to verify. Independent samples (i.e., those obtained without respect to each other) contain up to four sources of variability: material, process, sampling, and testing. Split samples contain variability only in the testing method. Thus, if the agency wishes to verify only that the contractor's testing methods are correct, then the use of split samples is best. This is referred to as test method verification. If the agency wishes to verify the contractor's overall production, sampling, and testing processes, then the use of independent samples is required. This is referred to as process verification. Each of these types of verification is evaluated in the following sections.

HYPOTHESIS TESTING AND LEVELS OF SIGNIFICANCE

Before discussing the various procedures that can be used for test method verification or process verification, two concepts must be understood: hypothesis testing and level of significance. When it is necessary to test whether or not it is reasonable to accept an assumption about a set of data, statistical tests (called hypothesis tests) are conducted. Strictly speaking, a statistical test neither proves nor disproves a hypothesis. What it does is prescribe a formal manner in which evidence is to be examined to make a decision regarding whether or not the hypothesis is correct.

To perform a hypothesis test, it is first necessary to define an assumed set of conditions known as the null hypothesis (H0). Additionally, an alternative hypothesis (Ha) is, as the name implies, an alternative set of conditions that will be assumed to exist if the null hypothesis is rejected. The statistical procedure consists of assuming that the null hypothesis is true and then examining the data to see if there is sufficient evidence that it should be rejected. The H0 cannot actually be proved, only disproved. If the null hypothesis cannot be disproved (or, to be statistically correct, rejected), it should be stated that we fail to reject, rather than prove or accept, the hypothesis. In practice, some people use accept rather than fail to reject, although this is not exactly statistically correct.

Verification testing is simply hypothesis testing. For test method or process verification purposes, the null hypothesis would be that the contractor's tests and the agency's tests have equal means, while the alternate hypothesis would be that the means are not equal.

Hypothesis tests are conducted at a selected level of significance, α, where α is the probability of incorrectly rejecting the H0 when it is actually true. The value of α is typically selected as 0.10, 0.05, or 0.01. For example, if α = 0.01 and the null hypothesis is rejected, then there is only 1 chance in 100 that H0 is true and was rejected in error.

The performance of hypothesis tests, or verification tests, can be evaluated by using OC curves. OC curves plot either the probability of not detecting a difference (i.e., accepting the null hypothesis that the populations are equal) or the probability of detecting a difference (i.e., rejecting the null hypothesis that the populations are equal) versus the actual difference between the two populations being compared. Curves that plot the probability of detecting a difference are sometimes call power curves because they plot the power of the statistical test procedure to detect a given difference.

Just as there is a risk of incorrectly rejecting the H0 when it is actually true, which is called the type I (or α) error, there is also a risk of failing to reject the H0 when it is actually false. This is called the type II (or β) error. The power is the probability of rejecting the H0 when it is actually false and it is equal to 1 - β. Both α and β are important and are used with the OC curves when determining the appropriate sample size to be used.

TEST METHOD VERIFICATION

The procedures for verifying the testing procedures should be based on split samples so that the testing method is the only source of variability present. The two procedures used most often for test method verification are: (1) comparing the difference between the split-sample results to a maximum allowable difference, and (2) the use of the t-test for paired measurements (i.e., the paired t-test). In this report, these are referred to as the maximum allowable difference and the paired t-test, respectively, and each is discussed below.

Maximum Allowable Difference

This is the simplest procedure that can be used for verification, although it is the least powerful. In this method, usually a single sample is split into two portions, with one portion tested by the contractor and the other portion tested by the agency. The difference between the two test results is then compared to a maximum allowable difference. Because the procedure uses only two test results, it cannot detect real differences unless the results are far apart.

The value selected for the maximum allowable difference is usually selected in the same manner as the D2S limits contained in many American Association of State Highway and Transportation Officials (AASHTO) and American Society for Testing and Materials (ASTM) test procedures. The D2S limit indicates the maximum acceptable difference between two results obtained on test portions of the same material (and thus applies only to split samples) and is provided for single- and multi-laboratory situations. It represents the difference between two individual test results that has approximately a 5-percent chance of being exceeded if the tests are actually from the same population.

Stated in general statistical terminology, the maximum allowable difference is set at two times the standard deviation of the distribution of the differences that would be obtained if the two test populations (the contractor's and the agency's) were actually equal. In other words, if the two populations are truly the same, there is approximately a 0.05 chance that this verification method will find them to be not equal. Therefore, the level of significance is 0.05 (5 percent).

OC Curves: OC curves were developed to evaluate the performance of the maximum allowable difference method for test method verification. In this method, a test is performed on a single split sample to compare the agency's and the contractor's test results. If we assume that both of these split test results are from normally distributed subpopulations, then we can calculate the variance of the difference and use it to calculate two standard deviation limits (approximately 95 percent) for the sample difference quantity.

Suppose that the agency's subpopulation has a variance σA2 and the contractor's subpopulation has a variance σC2. Since the variance of the difference in two independent random variables is the sum of the variances, the variance of the difference in an agency's observation and a contractor's observation is σA2 + σC2. The maximum allowable difference is based on the test standard deviation, which may be provided in the form of D2S limits. Let us call this test standard deviation σtest. Under an assumption that σA2 = σC2 = σtest2, this variance of a difference becomes test2.

The maximum allowable difference limits are set as two times the standard deviation of the test differences (i.e., approximately 95-percent limits). This, therefore, sets the limits at ±2√test2, which is ±2√2σtest (or ±2.8284σtest). Without loss of generality, we can assume σtest, along with an assumption of a mean difference of 0, and use the standard normal distribution with a region between -2.8284 and +2.8284 as the acceptance region for the difference in an agency's test result and a contractor's test result. With these two limits fixed, we can calculate the power of this decisionmaking process relative to various true differences in the underlying subpopulation means and/or various ratios of the true underlying subpopulation standard deviations.

These power values can conveniently be displayed as a three-dimensional surface. If we vary the mean difference along the first axis and the standard deviation ratio along a second axis, we can show power on the vertical axis. The agency's subpopulation, the contractor's subpopulation, or both, could have standard deviations that are smaller, about the same, or larger than the supplied σtest value. To develop OC curves, these situations were represented in terms of the minimum standard deviation between the contractor's population and the agency's population as follows:

  • Minimum standard deviation equals the test standard deviation (σtest).
  • Minimum standard deviation equals half the test standard deviation.
  • Minimum standard deviation equals twice the test standard deviation.

Figures 45 through 47 show the OC curves for each of the above cases. The power values are shown where the ratio of the larger of the agency's or the contractor's standard deviation to the smaller of the agency's or contractor's standard deviation is varied over the values 0, 1, 2, 3, 4, and 5. The mean difference given along the horizontal axis (values of 0, 1, 2, and 3) represents the difference in the agency's and contractor's subpopulation means expressed as multiples of σtest.

In figure 45, which shows the case when the minimum standard deviation equals the test standard deviation (σtest), even when the ratio of the contractor's and agency's standard deviations is 5 and the difference between the contractor's and the agency's means is three times the value for σtest, there is less than a 70-percent chance of detecting the difference based on the results from a single split sample. As would be expected, the power values decrease when the minimum standard deviation is half of σtest (figure 46) and increase when the minimum standard deviation is twice σtest (figure 47).

As is the case with any method based on a sample size = 1, the D2S method does not have much power to detect the differences between the contractor's and the agency's populations. The appeal of the maximum allowable difference method lies in its simplicity, rather than in its power.

Average Run Length: The maximum allowable difference method was also evaluated based on the average run length. The average run length is the average number of lots that it takes to identify a difference between dissimilar populations. As such, the shorter the average run length, the better.

Various actual differences between the contractor's and the agency's population means and standard deviations were considered in the analysis. In the results that are presented, i refers to the difference (in units of the agency's population standard deviation) between the agency's and the contractor's population means. Also, j refers to the ratio of the contractor's population standard deviation to the agency's population standard deviation. In the analyses, i values of 0, 1, 2, and 3 were used, while the j values used were 0.5, 1.0, 1.5, and 2.0. Some examples of these i and j values are illustrated in figure 48.

Click for text description

Figure 45. OC surface for the maximum allowable difference test method verification method (assuming the smaller σ = σtest).

Click for text description

Figure 46. OC surface for the maximum allowable difference test method verification method (assuming the smaller σ = 0.5 σtest).

Click for text description

Figure 47. OC surface for the maximum allowable difference test method verification method (assuming the smaller σ = 2 σtest).

Click for text description

Figure 48a. Example 1 of some of the cases considered in the average run length analysis for the maximum allowable difference method.

Click for text description

Figure 48b. Example 2 of some of the cases considered in the average run length analysis for the maximum allowable difference method.

Click for text description

Figure 48c. Example 3 of some of the cases considered in the average run length analysis for the maximum allowable difference method.

The results of the analyses are presented in table 31 and figure 49. These values are based on 5000 simulated projects. As shown in the table, when i = 0 and j = 1.0 (meaning that the contractor's and the agency's populations are the same), the average run length is approximately 21.5 project lots. This is consistent with what would be expected. Since the limits are set at 2 standard deviations and since there is only 0.0455 chance of a value outside of 2 standard deviations, there is only 1 chance in 22 of declaring the populations to be different for this situation. It should also be noted in the table that the standard deviation values are nearly as large as the average run lengths. This means that for any individual simulated project, the run length could have varied greatly from the average. Indeed, for this case, the individual run lengths varied from 1 to more than 200.

Table 31 clearly shows that as the difference between the population means (i) increases, the average run length decreases since it is easier to detect a difference between the two populations. This is also true for the ratio of the population standard deviations (j).

Table 31. Average run length results for the single split-sample method (5000 simulated lots).
Mean Difference, units of agency's σ Contractor's σ ÷ Agency's σ Run Length
Average Std. Dev.
00.585.5785.44
1.021.5520.88
1.58.438.04
2.04.834.19
10.519.1619.11
1.09.869.14
1.55.835.25
2.04.073.53
20.54.383.82
1.03.583.03
1.53.102.56
2.02.672.09
30.51.771.14
1.01.851.27
1.51.881.29
2.01.881.30

Paired t -Test

Since the maximum allowable difference is not a very powerful test, another procedure that uses multiple test results to conduct a more powerful hypothesis test can be used. For the case in which it is desirable to compare more than one pair of split-sample test results, the t-test for paired measurements (i.e., the paired t-test) can be used. This test uses the differences between pairs of tests and determines whether the average difference is statistically different from zero. Thus, it is the difference within the pairs, not between the pairs, that is being tested. The t-statistic for the paired t-test is:

(7)
t = | Xd |
 
Sd
 
n

where: Xd = average of the differences between the split-sample test results

Sd = standard deviation of the differences between the split-sample test results

n = number of split samples

The calculated t-value is then compared to the critical value (tcrit) obtained from a table of t-values at a level of α/2 and n - 1 degrees of freedom. Computer programs, such as Microsoft® Excel, contain statistical test procedures for the paired t-test. This makes the implementation process straightforward.

OC Curves: OC curves can be consulted to evaluate the performance of the paired t-test in identifying the differences between population means. OC curves are useful in answering the question, "How many pairs of test results should be used?" This form of the OC curve, for a given level of α, plots on the vertical axis the probability of either not detecting (β) or detecting (1 - β) a difference between two populations. The standardized difference between the two population means is plotted on the horizontal axis.

For a paired t-test, the standardized difference (d) is measured as:

(8)
d = | μc - μa |
 
σd

where: | μc - μa | = true absolute difference between the mean μc of the contractor's test result population (which is unknown) and the mean μa of the agency's test result population (which is unknown)

σd = standard deviation of the true population of signed differences between the paired tests (which is unknown)

The OC curves are developed for a given level of significance (α). OC curves for α values of 0.05 and 0.01 are shown in figures 49 and 50, respectively. It is evident from the OC curves that for any probability of not detecting a difference (β (value on the vertical axis)), the required n will increase as the difference (d) decreases (value on the horizontal axis). In some cases, the desired β or difference may require prohibitively large sample sizes. In that case, a compromise must be made between the discriminating power desired, the cost of the amount of testing required, and the risk of claiming a difference when none exists.

To use this OC curve, the true standard deviation of the signed differences (σd) is assumed to be known (or approximated based on past data or published literature). After experience is gained with the process, σd can be more accurately defined and a better idea of the required number of tests can be determined.

As an example of how to use the OC curves, assume that the number of pairs of split-sample tests for verification of some test method is desired. The probability of not detecting a difference (β) is chosen as 10 percent or 0.10. (Some OC curves, which are often called power curves, use 1 - β (known as the power of the test) on the vertical axis; however, the only difference is the scale change (in this case, 1 - β) being 90 percent or 0.90.) Assume that the absolute difference between μc and μa should not be greater than 20 units, that the standard deviation of the differences is 20 units, and that α is selected as 0.05. This produces a d value of 20 ÷ 20 = 1.0. Reading this value on the horizontal axis and a β of 0.20 on the vertical axis shows that about 10 paired split-sample tests are necessary for the comparison.

Click for text description

Figure 49. OC curves for a two-sided t-test ( α = 0.05) (Natrella, M.G., "Experimental Statistics," National Bureau of Standards Handbook 91, 1963).

Click for text description

Figure 50. OC curves for a two-sided t-test ( α = 0.01) (Natrella, M.G., "Experimental Statistics," National Bureau of Standards Handbook 91, 1963).

PROCESS VERIFICATION

Procedures to verify the overall process should be based on independent samples so that all of the components of variability (i.e., process, materials, sampling, and testing) are present. Two procedures for comparing independently obtained samples appear in the AASHTO Implementation Manual for Quality Assurance.(2) These two methods appear in the AASHTO manual in appendix G, which is based on the comparison of a single agency test with 5 to 10 contractor tests, and in appendix H, which is based on the use of the F-test and t-test to compare a number of agency tests with a number of contractor tests. These methods are referred to as the AASHTO appendix G method and the AASHTO appendix H method, respectively. Each of these methods is discussed and analyzed in the following sections.

AASHTO Appendix G Method

In this method, a single agency test result must fall within an interval that is defined from the average and range of 5 to 10 contractor test results. The allowable interval within which the agency's test must fall is X ± CR, where X and R are the mean and range, respectively, of the contractor's tests, and C is a factor that varies with the number of contractor tests. The factor C is the product of a factor to estimate the sample standard deviation from the sample range and the t-value for the 99th percentile of the t-distribution. This is not a particularly efficient approach, although this statement can be made for any method that is based on the use of a single agency test. Table 32 indicates the allowable interval based on the number of contractor tests.

Table 32. Allowable intervals for the AASHTO appendix G method.
Number of Contractor Tests Allowable Interval
10X ± 0.91 R
9X ± 0.97 R
8X ± 1.05 R
7X ± 1.17 R
6X ± 1.33 R
5X ± 1.61 R

OC Curves: Computer simulation was used to develop OC curves (plotted as power curves) that indicate the probability of detecting a difference between test populations with various differences in means and in the ratios of their standard deviations. The differences between the means of the contractor's and the agency's population

(Δ = ( μContr - μAgency ) / σAgency), stated in units of the agency's standard deviation, were varied from 0 to 3.0. Various ratios of the contractor's standard deviation to the agency's standard deviation (μContr - μAgency) were varied from 0.50 to 3.00.

Since there are two parameters that varied, OC surfaces were plotted, with each surface representing a different number of contractor tests (5 to 10) that were compared to a single agency test. These OC surfaces are shown in figure 51. As shown in the plots, the power of this procedure is quite low, even when a large number of contractor tests are used and when there are large differences in the means and standard deviations for the contractor's and the agency's populations. For example, for five contractor tests, even when the contractor's standard deviation is three times that of the agency and the contractor's mean is three of the agency's standard deviations from the agency's mean, there is less than a 50-percent chance of detecting a difference. Even if the number of contractor tests is 10, the probability of detecting a difference is still less than 60 percent.

Average Run Length: The method in appendix G was also evaluated based on the average run length. Various actual differences between the contractor's and the agency's population means and standard deviations were considered in the analysis. In the results that are presented, i refers to the difference (stated in units of the agency's population standard deviation) between the agency's and the contractor's population means. Also, j refers to the ratio of the contractor's population standard deviation to the agency's population standard deviation. In the analyses, i values of 0, 1, 2, and 3 were used, while j values of 0.5, 1.0, 1.5, and 2.0 were used.

The results of the simulation analyses, for the case of five contractor tests and one agency test per lot, are presented in table 33. The use of 5 and 10 contractor tests represents the upper and lower bounds, respectively, for the results since these are the fewest and most tests for the procedure. As shown in table 33, the run lengths can be quite large, particularly when the contractor's population standard deviation is larger than that of the agency. The values in the table are based on 5000 simulated projects.

Also note that the use of 10 tests gives a better performance than that of 5 tests when the contractor's standard deviation is equal to or less than that of the agency (ratios of 1.0 and 0.5). However, the opposite is true when the contractor's standard deviation is greater than that of the agency (ratios of 1.5 and 2.0). This is contrary to the desire to use a larger sample to identify the differences between the contractor's and the agency's populations.

Click for text description

Figure 51a. OC Surfaces (also called power surfaces) for the appendix G method for 5 contractor tests compared to a single agency test.

Click for text description

Figure 51b. OC surfaces (also called power surfaces) for the appendix G method for 6 contractor tests compared to a single agency test.

Click for text description

Figure 51c. OC surfaces (also called power surfaces) for the appendix G method for 7 contractor tests compared to a single agency test

Click for text description

Figure 51d. OC surfaces (also called power surfaces) for the appendix G method for 8 contractor tests compared to a single agency test.

Click for text description

Figure 51e. OC surfaces (also called power surfaces) for the appendix G method for 9 contractor tests compared to a single agency test.

Click for text description

Figure 51f. OC surfaces (also called power surfaces) for the appendix G method for 10 contractor tests compared to a single agency test.

Table 33. Average run length results for the appendix G method (5000 simulated lots).
Mean Difference, units of agency's σ Contractor's σ ÷ Agency's σ Run Length
AverageStd. Dev.
5 Contractor Tests and 1 Agency Test
00.57.927.57
1.043.3042.68
1.5124.19126.40
2.0234.45234.56
10.54.043.51
1.018.0417.78
1.554.7853.93
2.0114.63114.98
20.51.821.24
1.06.215.69
1.517.6117.23
2.039.3038.33
30.51.220.51
1.02.882.34
1.57.236.80
2.016.2315.74
10 Contractor Tests and 1 Agency Test
00.55.154.70
1.040.5039.90
1.5230.83226.93
2.0887.62882.77
10.52.742.18
1.012.7612.04
1.562.3361.14
2.0229.00227.47
20.51.390.73
1.03.763.32
1.513.3012.61
2.046.1746.19
30.51.070.28
1.01.751.20
1.54.463.94
2.012.7712.15

AASHTO Appendix H Method

This procedure involves two hypothesis tests where the null hypothesis for each test is that the contractor's tests and the agency's tests are from the same population. In other words, the null hypotheses are that the variability of the two data sets is equal for the F-test and that the means of the two data sets are equal for the t-test.

The procedures for the F-test and the t-test are more complicated and involved than that for the appendix G method discussed above. The F-test and the t-test approach also requires more agency test results before a comparison can be made. However, the use of the F-test and the t-test is much more statistically sound and has more power to detect actual differences than the appendix G method, which relies on a single agency test for the comparison. Any comparison method that is based on a single test result will not be very effective in detecting differences between data sets.

When comparing two data sets that are assumed to be normally distributed, it is important to compare both the means and the variances. A different test is used for each of these comparisons. The F-test provides a method for comparing the variances (standard deviations squared) of the two sets of data. The differences in the means are assessed by the t-test. To simplify the use of these tests, they are available as built-in functions in computer spreadsheet programs such as Microsoft® Excel. For this reason, the procedures involved are not discussed in this report. The procedures are fully discussed in the QA manual that was prepared as part of this project.(1)

A question that needs to be answered is: What power do these statistical tests have, when used with small to moderate sample sizes, to declare that various differences in the means and variances are statistically significant? This question is addressed separately for the F-test and the t-test with the development of the OC curves in the following sections.

F-Test for Variances (Equal Sample Sizes): Suppose that we have two sets of measurements that are assumed to come from normally distributed populations and we wish to conduct a test to see if they come from populations that have the same variances (i.e., σx2 - σy2). Furthermore, suppose that we select a level of significance of α = 0.05, meaning that we are allowing up to a 5-percent chance of incorrectly deciding that the variances are different when they are really the same. If we assume that these two samples are x1, x2,...xnx and y1, y2,...yny, we can calculate the sample variances and s2x and s2y construct:

(9)

F = sx2 / sy2

and σx2 - σy2 accept for the values of F in the interval [ F1 - α/2,nx-1,ny-1, Fα/2,nx,ny-1 ].

For this two-sided or two-tailed test, figure 52 shows the probability that we have accepted the two samples as coming from populations with the same variability. This probability is usually referred to as β and the power of the test is usually referred to as 1 - β. Notice that the horizontal axis is the quantity λ, where λ = σxy, the true standard deviation ratio. Thus, for λ = 1, where the hypothesis of equal variance should certainly be accepted, it is accepted with a probability of 0.95, reduced from 1.00 only by the magnitude of our type I error risk (α). One significant limiting factor for the use of figure 52 is the restriction that nx = ny = n. This limitation is addressed in subsequent sections of the report.

Example: Suppose that we have nx = 6 contractor tests and ny = 6 agency tests, conduct an α = 0.05 level test and accept (or fail to reject) that these two sets of tests represent populations with equal variances. What power did our test have to discern whether the populations from which these two sets of tests came were really rather different in variability? Suppose that the true population standard deviation of the contractor's tests (σx) was twice as large as that of the agency's tests (σy), giving λ = 2. If we enter figure 52 with λ = 2 and nx = ny = 6, we find that β ≈ 0.74 or that the power (1 - β) is 0.26. This tells us that with samples of nx = 6 and ny = 6, we only have a 26-percent chance of detecting a standard deviation ratio of 2 (and, correspondingly, a fourfold difference in variance) as being different.

Suppose that we are not comfortable with the power of 0.26, so subsequently we increase the number of tests used. Then suppose that we now have nx = 20 and ny = 20. If we again consider λ = 2, we can determine from figure 52 that the power of detecting these sets of tests as coming from populations with unequal variances to be more than 0.80 (approximately 82 to 83 percent). If we proceed to conduct our F-test with these two samples and conclude that the underlying variances are equal, we will certainly feel much more comfortable with our conclusions.

Figure 53 gives the appropriate OC curves to be used if we choose to conduct an α = 0.01 level test. Again, we see that for equal variances σy2 and σx2 (i.e., λ = 1), that β = 0.99, reduced from 1.00 only by the size of α.

F-Test for Variances (Unequal Sample Sizes): Up to now, the discussions and OC curves have been limited to equal sample sizes. Routines were developed for this project to calculate the power for this test for any combination of sample sizes nx and ny. There are obviously an infinite number of possible combinations for nx and ny. Thus, it is not possible to present OC curves for every possibility. However, three sets of tables were developed to provide a subset of power calculations using some sample sizes that are of potential interest for comparing the contractor's and the agency's samples. These power calculations are presented in table form since there are too many variables to be presented in a single chart, and the data can be presented in a more compact form in tables than in a long series of charts. Table 34 gives power values for all combinations of sample sizes of 3 to 10, with the ratio of the two subpopulation standard deviations = 1, 2, 3, 4, and 5. Table 35 gives power values for the same sample sizes, but with the standard deviation ratios = 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0. Table 36 gives power values for all combinations for sample sizes = 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, and 100, with the standard deviation ratio = 1, 2, or 3.

Click for text description

Figure 52. OC curves for the two-sided F-test for level of significance α = 0.05 (Bowker, A.H., and G.J. Lieberman, Engineering Statistics).

Click for text description

Figure 53. OC curves for the two-sided F-test for level of significance α = 0.01 (Bowker, A.H., and G.J. Lieberman, Engineering Statistics).

Table 34. F-test power values for n = 3-10 and s-ratio λ = 1-5.
λ ny nx Power
1330.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
430.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
530.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
630.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
730.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
830.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
930.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
1030.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
2330.09939
40.09753
50.09663
60.09620
70.09600
80.09590
90.09586
100.09585
430.14835
40.15169
50.15385
60.15544
70.15668
80.15767
90.15848
100.15915
530.19036
40.20240
50.21041
60.21622
70.22064
80.22413
90.22694
100.22926
630.22309
40.24464
50.25968
60.27093
70.27968
80.28669
90.29243
100.29722
730.24820
40.27854
50.30055
60.31744
70.33086
80.34179
90.35087
100.35853
830.26768
40.30567
50.33401
60.35619
70.37410
80.38888
90.40129
100.41187
930.28308
40.32758
50.36144
60.38837
70.41036
80.42869
90.44421
100.45752
1030.29549
40.34549
50.38414
60.41521
70.44081
80.46230
90.48060
100.49639
3330.19034
40.19354
50.19556
60.19696
70.19798
80.19875
90.19934
100.19981
430.31171
40.33525
50.35007
60.36030
70.36777
80.37347
90.37795
100.38157
530.39758
40.44454
50.47603
60.49872
70.51588
80.52931
90.54011
100.54899
630.45403
40.51906
50.56396
60.59696
70.62225
80.64225
90.65846
100.67186
730.49230
40.57007
50.62436
60.66443
70.69516
80.71943
90.73906
100.75523
830.51945
40.60623
50.66693
60.71159
70.74565
80.77236
90.79378
100.81129
930.53955
40.63285
50.69797
60.74560
70.78161
80.80958
90.83177
100.84970
1030.55494
40.65311
50.72136
60.77092
70.80803
80.83654
90.85890
100.87675
4330.29251
40.30367
50.31010
60.31427
70.31717
80.31930
90.32093
100.32222
430.46558
40.51179
50.54104
60.56126
70.57608
80.58742
90.59637
100.60363
530.56455
40.63665
50.68356
60.71649
70.74084
80.75955
90.77437
100.78638
630.62143
40.70759
50.76314
60.80150
70.82932
80.85027
90.86652
100.87943
730.65697
40.75074
50.81002
60.84993
70.87808
80.89866
90.91416
100.92613
830.68090
40.77901
50.83976
60.87961
70.90692
80.92628
90.94042
100.95100
930.69798
40.79871
50.85988
60.89907
70.92520
80.94321
90.95598
100.96525
1030.71073
40.81311
50.87423
60.91256
70.93751
80.95427
90.96583
100.97399
5330.39165
40.41270
50.42481
60.43266
70.43815
80.44219
90.44530
100.44776
430.58713
40.64932
50.68814
60.71467
70.73394
80.74858
90.76007
100.76932
530.68068
40.76196
50.81171
60.84479
70.86811
80.88527
90.89836
100.90860
630.72975
40.81790
50.86956
60.90223
70.92409
80.93936
90.95041
100.95864
730.75893
40.84940
50.90024
60.93086
70.95030
80.96318
90.97201
100.97824
830.77800
40.86909
50.91845
60.94695
70.96423
80.97513
90.98225
100.98704
930.79133
40.88238
50.93024
60.95690
70.97244
80.98184
90.98772
100.99150
1030.80115
40.89188
50.93838
60.96351
70.97767
80.98594
90.99092
100.99400
Table 35. F-test power values for n = 3-10 and s-ratio λ = 0-1.
λ ny nx Power
0.0331.00000
41.00000
51.00000
61.00000
71.00000
81.00000
91.00000
101.00000
431.00000
41.00000
51.00000
61.00000
71.00000
81.00000
91.00000
101.00000
531.00000
41.00000
51.00000
61.00000
71.00000
81.00000
91.00000
101.00000
631.00000
41.00000
51.00000
61.00000
71.00000
81.00000
91.00000
101.00000
731.00000
41.00000
51.00000
61.00000
71.00000
81.00000
91.00000
101.00000
831.00000
41.00000
51.00000
61.00000
71.00000
81.00000
91.00000
101.00000
931.00000
41.00000
51.00000
61.00000
71.00000
81.00000
91.00000
101.00000
1031.00000
41.00000
51.00000
61.00000
71.00000
81.00000
91.00000
101.00000
0.2330.39165
40.58713
50.68068
60.72975
70.75893
80.77800
90.79133
100.80115
430.41270
40.64932
50.76196
60.81790
70.84940
80.86909
90.88238
100.89188
530.42481
40.68814
50.81171
60.86956
70.90024
80.91845
90.93024
100.93838
630.43266
40.71467
50.84479
60.90223
70.93086
80.94695
90.95690
100.96351
730.43815
40.73394
50.86811
60.92409
70.95030
80.96423
90.97244
100.97767
830.44219
40.74858
50.88527
60.93936
70.96318
80.97513
90.98184
100.98594
930.44530
40.76007
50.89836
60.95041
70.97201
80.98225
90.98772
100.99092
1030.44776
40.76932
50.90860
60.95864
70.97824
80.98704
90.99150
100.99400
0.4330.14221
40.22806
50.29564
60.34398
70.37868
80.40429
90.42380
100.43906
430.14250
40.24034
50.32488
60.38884
70.43614
80.47159
90.49879
100.52015
530.14291
40.24808
50.34448
60.42028
70.47749
80.52079
90.55411
100.58029
630.14332
40.25345
50.35863
60.44371
70.50889
80.55851
90.59674
100.62671
730.14369
40.25739
50.36934
60.46187
70.53357
80.58837
90.63057
100.66355
830.14399
40.26041
50.37772
60.47638
70.55351
80.61261
90.65804
100.69341
930.14424
40.26278
50.38447
60.48825
70.56996
80.63266
90.68076
100.71805
1030.14445
40.26470
50.39001
60.49813
70.58375
80.64952
90.69984
100.73868
0.6330.07564
40.10273
50.12665
60.14614
70.16173
80.17425
90.18444
100.19283
430.07283
40.10212
50.13003
60.15430
70.17470
80.19170
90.20593
100.21791
530.07120
40.10174
50.13222
60.15988
70.18396
80.20461
90.22225
100.23736
630.07022
40.10157
50.13386
60.16407
70.19107
80.21472
90.23528
100.25314
730.06960
40.10153
50.13516
60.16736
70.19675
80.22292
90.24600
100.26628
830.06919
40.10155
50.13622
60.17003
70.20139
80.22972
90.25499
100.27741
930.06891
40.10161
50.13711
60.17223
70.20526
80.23545
90.26265
100.28698
1030.06870
40.10168
50.13786
60.17409
70.20854
80.24035
90.26925
100.29529
0.8330.05467
40.06163
50.06758
60.07248
70.07649
80.07980
90.08255
100.08487
430.05202
40.05929
50.06587
60.07156
70.07642
80.08057
90.08412
100.08719
530.05017
40.05755
50.06448
60.07067
70.07612
80.08090
90.08508
100.08875
630.04883
40.05626
50.06340
60.06995
70.07584
80.08109
90.08577
100.08994
730.04785
40.05529
50.06258
60.06938
70.07560
80.08124
90.08633
100.09092
830.04709
40.05453
50.06193
60.06893
70.07541
80.08136
90.08680
100.09175
930.04650
40.05393
50.06141
60.06856
70.07527
80.08148
90.08721
100.09248
1030.04603
40.05345
50.06099
60.06827
70.07516
80.08159
90.08757
100.09312
1.0330.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
430.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
530.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
630.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
730.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
830.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
930.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
1030.05000
40.05000
50.05000
60.05000
70.05000
80.05000
90.05000
100.05000
Table 36. F-test power values for n = 5-100 and s-ratio λ = 1-3.
λ ny nx Power
1550.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
1050.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
1550.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
2050.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
2550.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
3050.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
4050.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
5050.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
6050.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
7050.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
8050.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
9050.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
10050.05
100.05
150.05
200.05
250.05
300.05
400.05
500.05
600.05
700.05
800.05
900.05
1000.05
2550.21041
100.22926
150.23658
200.24043
250.24281
300.24442
400.24646
500.24770
600.24853
700.24913
800.24958
900.24993
1000.25022
1050.38414
100.49639
150.55109
200.58353
250.60501
300.62027
400.64053
500.65336
600.66221
700.66869
800.67363
900.67753
1000.68068
1550.45487
100.62152
150.70573
200.75560
250.78820
300.81099
400.84054
500.85870
600.87092
700.87969
800.88626
900.89137
1000.89545
2050.49087
100.68548
150.78230
200.83747
250.87192
300.89495
400.92304
500.93906
600.94918
700.95606
800.96099
900.96468
1000.96753
2550.51241
100.72299
150.82516
200.88085
250.91389
300.93485
400.95864
500.97099
600.97817
700.98272
800.98578
900.98795
1000.98955
3050.52669
100.74730
150.85174
200.90637
250.93725
300.95585
400.97551
500.98476
600.98968
700.99256
800.99436
900.99556
1000.99639
4050.54439
100.77664
150.88220
200.93379
250.96067
300.97548
400.98924
500.99462
600.99702
700.99821
800.99886
900.99923
1000.99945
5050.55491
100.79358
150.89881
200.94770
250.97160
300.98387
400.99414
500.99757
600.99888
700.99943
800.99969
900.99982
1000.99989
6050.56187
100.80456
150.90914
200.95588
250.97764
300.98820
400.99632
500.99869
600.99948
700.99977
800.99989
900.99995
1000.99997
7050.56683
100.81224
150.91614
200.96120
250.98137
300.99073
400.99745
500.99921
600.99972
700.99989
800.99996
900.99998
1000.99999
8050.57053
100.81791
150.92118
200.96490
250.98387
300.99235
400.99810
500.99947
600.99984
700.99994
800.99998
900.99999
1001.00000
9050.57339
100.82226
150.92497
200.96762
250.98564
300.99345
400.99851
500.99962
600.99989
700.99997
800.99999
901.00000
1001.00000
10050.57568
100.82571
150.92793
200.96968
250.98696
300.99425
400.99879
500.99972
600.99993
700.99998
800.99999
901.00000
1001.00000
3550.47603
100.54899
150.57700
200.59187
250.60108
300.60736
400.61537
500.62026
600.62355
700.62593
800.62772
900.62911
1000.63024
1050.72136
100.87675
150.92836
200.95158
250.96404
300.97154
400.97985
500.98420
600.98681
700.98853
800.98973
900.99062
1000.99130
1550.78336
100.93786
150.97640
200.98918
250.99431
300.99669
400.99860
500.99928
600.99957
700.99972
800.99980
900.99985
1000.99988
2050.80975
100.95808
150.98816
200.99597
250.99841
300.99930
400.99982
500.99994
600.99998
700.99999
800.99999
901.00000
1001.00000
2550.82417
100.96743
150.99254
200.99797
250.99936
300.99977
400.99996
500.99999
601.00000
701.00000
801.00000
901.00000
1001.00000
3050.83321
100.97267
150.99463
200.99877
250.99968
300.99990
400.99999
501.00000
601.00000
701.00000
801.00000
901.00000
1001.00000
4050.84390
100.97822
150.99654
200.99938
250.99987
300.99997
401.00000
501.00000
601.00000
701.00000
801.00000
901.00000
1001.00000
5050.84999
100.98107
150.99738
200.99960
250.99993
300.99999
401.00000
501.00000
601.00000
701.00000
801.00000
901.00000
1001.00000
6050.85393
100.98279
150.99783
200.99971
250.99996
300.99999
401.00000
501.00000
601.00000
701.00000
801.00000
901.00000
1001.00000
7050.85668
100.98394
150.99812
200.99976
250.99997
301.00000
401.00000
501.00000
601.00000
701.00000
801.00000
901.00000
1001.00000
8050.85871
100.98476
150.99831
200.99980
250.99998
301.00000
401.00000
501.00000
601.00000
701.00000
801.00000
901.00000
1001.00000
9050.86026
100.98537
150.99844
200.99983
250.99998
301.00000
401.00000
501.00000
601.00000
701.00000
801.00000
901.00000
1001.00000
10050.86150
100.98584
150.99855
200.99985
250.99998
301.00000
401.00000
501.00000
601.00000
701.00000
801.00000
901.00000
1001.00000

From these tables, it is obvious that the limiting factor in how well the F-test will be able to identify differences will be the number of agency verification tests. The power of the F-test is limited not by the larger of the sample sizes, but by the smaller of the sample sizes. For example, in table 34, when nx = 3 and ny = 10, the power is only about 20 percent, even when there is a threefold difference in the true standard deviations (i.e., λ = 3). The limiting aspect of the smaller sample size is also noticeable in table 36 for larger sample sizes. For example, for λ = 2 and for ny = 100, the power when nx = 5 is only about 25 percent. The power increases to 68 percent for nx = 10, 90 percent for nx = 15, and 97 percent for nx = 20. Since the agency will have fewer verification tests than the number of contractor tests, the agency's verification sampling and testing rate will determine the power to identify variability differences when they exist.

t-Test for Means: As with the appendix G method, the performance of the t-test for means can be evaluated with OC curves or by considering the average run length.

OC Curves: Suppose that we have two sets of measurements that are assumed to be from normally distributed populations and that we wish to conduct a two-sided or two-tailed test to see if these populations have equal means (i.e., m x = m y). Suppose that we assume that these two samples are from populations with unknown, but equal, variances. If these two samples are x1, x2..., xnx, with sample mean X and sample variance s2x, and y1, y2,..., yny, with sample mean Y and sample variance s2y, we can calculate:

(10)
t =
X - Y
 
  sx2 ( nx - 1 ) + sy2 ( ny - 1 ) ×   1 + 1
     
nx + ny - 2 nx ny

and accept H0: μx = μ x for values of t in the interval [-t α/2, n x+ny-2, t α/2, n x+ny-2].

For this test, figure 49 or 50, depending on the α value, shows the probability that we have accepted the two samples as coming from populations with the same means. The horizontal axis scale is:

(11)
d = | μx - μy |
 
σ

where: σ = σx = σy = true common population standard deviation

We can access the OC curves in figure 49 or 50 with a value for d of d* and a value for n of n'

where:

(12)

n' = nx + ny - 1

and

(13)
d* = d ×   nx × ny
   
n' nx + ny

Example: Suppose that we have nx = 8 contractor tests and ny = 8 agency tests, conduct an α = 0.05 level test and accept that these two sets of tests represent populations with equal means. What power did our test really have to discern if the populations from which these two sets of tests came had different means? Suppose that we consider a difference in these population means of 2 or more standard deviations as a noteworthy difference that we would like to detect with high probability. This would indicate that we are interested in d = 2. Calculating

(14)

n' = nx + ny - 1 = 8 + 8 - 1 = 15

and

(15)
d* = d ×   nx × ny = 2 ×   8 × 8 = 1.0328
       
n' nx + ny 15 8 + 8

we find from figure 50 that β ≈ 0.05, so that our power of detecting a mean difference of 2 or more standard deviations would be approximately 95 percent.

Now suppose that we consider an application where we still have a total of 16 tests, but with nx = 12 contractor tests and ny = 4 agency tests. Suppose that we are again interested in the t-test performance in detecting a means difference of 2 standard deviations. Again, calculating

(16)

n' = nx + ny - 1 = 8 + 8 - 1 = 15

but now

(17)
d* = d ×   nx × ny = 2 ×   12 × 4 = 0.8944
       
n' nx + ny 15 12 + 4

we find from figure 50 that β ≈ 0.12, indicating that our power of detecting a mean difference of 2 or more standard deviations would be approximately 88 percent.

Figure 51 gives the appropriate OC curves for use in conducting an α = 0.01 level test on the means. This figure is accessed in the same manner as described above for figure 50.

Average Run Length: The effectiveness of the t-test procedure was evaluated by determining the average run length in terms of project lots. The evaluation was performed by simulating 1000 projects and determining, on average, how many lots it took to determine that there was a difference between the contractor's and the agency's population means.

The results of the simulation analyses, for the case of five contractor tests and one agency test per lot, are presented in table 37. The results are shown only for the case where five contractor tests and one agency test are performed on each project lot. Similar results were obtained for cases where fewer and more contractor tests were conducted per lot. As shown in table 37, when there is no difference between the population means, the run lengths are quite large (as they should be). The values with asterisks are biased on the low side, because to speed up the simulation time, the maximum run lengths were limited to 100. Therefore, the actual average run length would be greater than those shown in the table since the maximum cutoff value was reached in more than half of the 1000 projects simulated for each i and j combination.

The average run lengths become relatively small as the actual difference between the contractor's and the agency's population means increases. This is obviously what is desired.

Table 37. Average run length results for the appendix H method (5 contractor tests and 1 agency test per lot) for 1000 simulated lots.
Mean Difference, units of agency's σ Contractor's σ ÷ Agency's σ Run Length
Average Std. Dev.
00.555.47*46.01*
1.070.15*41.91*
1.577.78*36.95*
2.075.72*38.56*
10.54.834.05
1.05.754.28
1.58.635.70
2.09.835.94
20.52.601.18
1.02.641.02
1.53.511.52
2.04.402.03
30.52.350.73
1.02.100.37
1.52.360.66
2.02.881.03

*These values are lower than the actual values. To reduce the simulation processing time, the maximum number of lots was limited to 100. For these cases, more than half of the projects were truncated at 100 lots.

CONCLUSIONS AND RECOMMENDATIONS

Based on the analyses that were conducted and were summarized in this chapter, the following recommendations were made:

Recommendation for Test Method Verification

The comparison of a single split sample by using the maximum allowable limits (such as the D2S limits) is simple and can be done for each split sample that is obtained. However, since it is based on comparing only single data values, it is not very powerful for identifying differences where they exist. It is recommended that each individual split sample be compared using the maximum allowable limits, but that the paired t-test also be used on the accumulated split-sample results to allow for a comparison with more discerning power. If either of these comparisons indicates a difference, then an investigation to identify the cause of the difference should be initiated.

Recommendation for Process Verification

Since they are both based on five contractor tests and one agency test per lot, the results in tables 33 and 37 can be used to compare the appendix H and appendix G methods. The average run lengths for the appendix H method (t-test) were better than those for the appendix G method (single agency test compared to five contractor tests). Compared to the appendix G method, the appendix H method had longer average run lengths where there was no difference in the means and shorter lengths where there was a difference in the means. This is what is desirable in the verification procedure. The appendix H method is recommended for use in verifying the contractor's test results when the agency obtains independent samples for evaluating the total process.

From the OC curves that were developed, it is apparent that the number of agency verification tests will be the deciding factor when determining the validity of the contractor's overall process. When using the OC curves in figure 50 or 51, the lower the value of d*, the lower the power of the test for a given number of test results. The value for d* will decrease as the agency's portion of the total number of tests declines (this is shown in equation 13). If, in the expression under the square root sign, the total number of tests (nx + ny) is fixed, then the value of d* will decrease as the value of either nx or ny goes down.

An example will illustrate this point. Suppose that the total of nx + ny is fixed at 16, then the maximum value under the square root sign will be when nx = ny = 8. This is true because the denominator is fixed at 16 and 8 ' 8 = 64 is larger than any other combination of numbers that total 16. As one of the values gets smaller (and the other gets correspondingly larger), the product of the two numbers will decrease, thereby decreasing d* and reducing the power of the test.

The amount of verification sampling and testing is a subjective decision for each individual agency. However, with the OC (or power) curves and tables in this chapter, an agency can determine the risks that are associated with any frequency of verification testing and can make an informed decision regarding this testing frequency.

When using the appendix H method, first, an F-test is used to determine whether or not the variances (and, hence, standard deviations) are different for the two populations. The result of the F-test determines how the subsequent t-test is conducted to compare the averages of the contractor's and the agency's test results. Given some of the low powers associated with small sample sizes in tables 34 through 36, it could be argued that an agency will rarely be able to conclude from the F-test that a difference in variances exists. Given this fact, it may be reasonable to just assume that the populations have equal variances and run the t-test for equal variances and ignore the F-test altogether. This argument has some merit. However, with the ease of conducting the F-test and the t-test by computer, once the test results are input, there is essentially no additional effort associated with conducting the F-test before the t-test.

FHWA-HRT-04-046

<< PreviousContentsNext >>

Events

More Information

Contact

Peter Kopac
Turner Fairbank
202-493-3151
E-mail Peter

 
 
This page last modified on 09/01/06
 

FHWA
United States Department of Transportation - Federal Highway Administration