A Review of Item Exposure Control Procedures for Enhancing Test Security

P. Adam Kelly
Ph. D. Student
Educational Research - Measurement & Statistics
Florida State University

February, 1999

 

The issue of item exposure control has received increased attention over the last several years, due in large part to the ongoing large-scale commercial implementation of computer adaptive test (CAT) programs. In late 1998, ETS inaugurated its full-scale CAT-TOEFL exam worldwide (albeit with some degree of difficulty), and during the previous year rolled out its CAT version of the GRE General test at testing centers in the United States. A CAT version of the SAT is expected to be ready for commercial use sometime during the next several years. Behind these roll-outs are extensive simulation studies of the performance characteristics of CATs versus traditional pencil-and-paper tests, dating back as far as 1990 (Eignor, 1993). While other testing programs have yet to put CATs into full commercial operation, ACT is currently preparing a full CAT version of its MCAT, and the LSAC expects to begin CAT-LSAT testing in 1999.

Now that CAT appears to be "here," issues like test security and, in particular, item exposure become all the more critical. Stocking (1993) points out that the continuous nature of CAT administration, as compared to the periodic nature of traditional linear test administration, means that test items may be seen – and possibly remembered – more frequently by examinees, leading to a security compromise of the items. Mills & Stocking (1995) name controlling item exposure as one of the greatest threats to test validity as CATs come off the drawing board and into operation. Davey & Parshall (1995) and Parshall, Davey & Nering (1998) identify three primary, often conflicting goals of CAT:

    1. to maximize test efficiency by measuring examinees as quickly and as accurately as possible;
    2. to assure that the test measures the same composite of multiple traits for each examinee by balancing the rates at which items with different content properties are administered; and
    3. to protect the security of the item pool by controlling the rates at which popular items can be
      administered.

It is the last of these that is addressed by item exposure control. As Stocking & Lewis (1995), Stocking & Lewis (1998) and Davey & Parshall (1995) all indicate, the items in an item bank that provide the most information about an examinee’s ability are invariably the items that have the best discrimination for the examinee’s ability level. These items are "most valuable players" or, as Nering, Thompson & Davey (1998) say, "thoroughbreds" in the item bank and, as such, are called upon more frequently than other items because they perform so well. Use of these items helps achieve goals 1 and 2 above, but works against goal 3, since these items are the ones with the greatest risk of overexposure, because they "overlap," or appear on so many CATs.

Item selection algorithms, of which there are many, operate strictly on "best fit" considerations (e.g., item and test information maximization, content validity, appropriate dimensionality) without regard to the number of times an item has been administered before. Therefore, item exposure control is a process to "override the optimal selection of the next item" in favor of administration frequency concerns. Stocking (1994) explains how working with a small item pool can seriously undermine test security in CAT, especially when item selection algorithms fail to compensate for item exposure control.

The earliest known item exposure control procedure, reported by McBride & Martin (1983), randomizes only the first five items on a test, selecting one out of the five (independently selected) "best" items to be administered first, one out of the four selected "remaining best" items administered second, and so on until the fifth and subsequent items, for which the single best available item is administered. No item exposure control protection is provided beyond the fourth item in the test. The "4-3-2-1" procedure (McBride & Martin, 1983) extends item exposure control to all items on a test and improves slightly on equal-odds administration. This procedure selects the four "best" items, in rank order, at any given point in a test, and assigns the probability of administration at 40%, 30%, 20% and 10%, respectively. Unfortunately, this still does not provide adequate item exposure control for the most "popular" items in the item bank, especially at the extremes of difficulty, since all items are treated equally a priori as being available for administration.

In 1985, Sympson & Hetter (1985) presented a procedure for fully separating testwise item administration from overall item exposure by assigning a new parameter, called the "exposure control parameter," to each item in an item bank. This parameter, kj, ranging from 0 to 1, represents the probability that item j is administered on a test given that it has been selected. kj is assigned to item j based on a user-defined maximum overall acceptable probability, rj, of item j being administered. Whenever item j is selected, a random number from 0 to 1 is generated, and only if that random number is less than kj is item j actually administered on that test. The value of kj is set empirically from repeated simulations of the CAT so that the overall probability of an item’s use does not exceed r for any item (Thissen & Mislevy, 1990).

Stocking & Lewis (1995) couple the Sympson-Hetter procedure with a new item selection algorithm. This procedure, which Stocking & Lewis name the "multinomial method," includes an exposure control parameter adjustment phase analogous to the Sympson & Hetter (1985) procedure plus an item selection process that enhances item bank usage. In Stocking & Lewis’ (1995) selection phase, a set of items is selected without regard to exposure, the cumulative binomial, or "multinomial," probability distribution is calculated for this set of items, a random number between 0 and 1 is generated, this number is matched on the multinomial distribution to an item on the list, and all items below that item on the multinomial distribution are barred from availability for the rest of the test.

While both the Sympson-Hetter procedure and the Stocking & Lewis (1995) enhancement of that procedure allow for differentiation of testwise item availability across the pool of items, neither procedure provides protection for items at the extremes of the difficulty range, such as the "thoroughbreds." For example, in a simulation study, while the exposure rate from several thousand administrations to simulated examinees ("simulees") of a highly difficult item was reported as 20% overall, it approached 100% for the simulees of high ability alone (Stocking & Lewis, 1998).

In response to this shortcoming of the so-called "unconditional" Sympson & Hetter procedure, two extensions were introduced somewhat simultaneously by Davey & Parshall (1995) and Stocking & Lewis (1995). Both the Davey-Parshall and Stocking-Lewis procedures incorporate, for the first time, conditionality in item exposure control, such that a given item is assigned a vector of k exposure control parameters, one for each of k conditional levels, rather than a single parameter. The Davey-Parshall procedure conditions item exposure on the items appearing previously in a test, while the Stocking-Lewis procedure, dubbed the "conditional Sympson-Hetter" by Parshall, Davey, & Nering (1998), conditions on examinee ability estimated at the given point in a test. Another "conditional Sympson-Hetter" procedure, similar to the Stocking-Lewis, is presented by Thomasson (1995). In this procedure, item exposure control is conditioned on examinee ability while a selection algorithm different than the multinomial method is used. Unfortunately, a presentation of this procedure is not currently published or otherwise accessible.

The latest innovation in conditional item exposure control is by Nering, Thompson & Davey (1998), a "hybrid" of the Davey-Parshall and Stocking-Lewis procedures that provides conditioning on both previous item appearance and point-estimated examinee ability. This newest procedure has worked well in simulation studies, and its use appears to incur low marginal cost in terms of programming and computational capacity as compared to either the Davey-Parshall or Stocking-Lewis procedures alone. ACT claims to be satisfied with the performance of the Nering-Thompson-Davey hybrid, and may implement it in a CAT version of its tests. ETS currently uses the Stocking-Lewis procedure for its CAT-GRE, and a simpler randomization approach, similar to the 4-3-2-1 procedure, for its prototype CAT-SAT (Eignor, 1993).

 

References

Davey, T., & Parshall, C. G. (1995). New algorithms for item selection and exposure control with computerized adaptive testing. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco.

Eignor, D. R., et al. (1993). Case studies in computer adaptive test design through simulation. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Atlanta.

McBride, J. R., & Martin, J. T. (1983). Reliability and validity of adaptive ability tests in a military setting. In D. J. Weiss (Ed.), New Horizons in Testing (pp. 223-236). New York: Academic Press.

Mills, C. N., & Stocking, M. L. (1996). Practical issues in large-scale computerized adaptive testing. Applied Measurement in Education, 9(4), 287-394.

Nering, M. L., Thompson, T., & Davey, T. (1998). Controlling item exposure and maintaining item security. Paper presented at the Annual Meeting of the Psychometric Society, Gatlinburg, TN.

Parshall, C. G., Davey, T., & Nering, M. L. (1998). Test development exposure control for adaptive testing. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Diego.

Stocking, M. L. (1993). Controlling item exposure rates in a realistic adaptive testing paradigm. (Research Report 93-xx). Princeton, NJ: Educational Testing Service.

Stocking, M. L. (1994). Three practical issues for modern adaptive testing item pools. (Research Report 94-xx). Princeton, NJ: Educational Testing Service.

Stocking, M. L., & Lewis, C. (1995). A new method of controlling item exposure in computerized adaptive testing. (Research Report 95-25). Princeton, NJ: Educational Testing Service.

Stocking, M. L., & Lewis, C. (1998). Controlling item exposure conditional on ability in computerized adaptive testing. Journal of Educational and Behavioral Statistics, 23(1), 57-75.

Thissen, D., & Mislevy, R. J. (1990). Testing Algorithms. In Wainer, H., Dorans, N. J., Flaugher, R., Green, B. F., Mislevy, R. J., Steinberg, L., & Thissen, D., Computerized Adaptive Testing: A Primer (pp. 103-135). Hillsdale, NJ: Lawrence Erlbaum Associates.

Thomasson, G. L. (1995). New item exposure control algorithms for computerized adaptive testing. Paper presented at the annual meeting of the Psychometric Society, Minneapolis.