Approaches to Evaluating Welfare Reform:
Lessons from Five State Demonstrations

Chapter 6:
Analysis Methods

A major goal of a welfare reform evaluation is to analyze the reform's impacts. This analysis must take into account several issues. This chapter addresses four analytic issues for welfare reform evaluations:

  1. What efforts, if any, should be made to distinguish impacts of specific welfare policy changes?
  2. What efforts, if any, should be made to estimate entry effects arising from welfare reform?
  3. How should crossover cases be treated for analysis purposes?
  4. Should impacts be estimated for subgroups that are defined by events that occur after random assignment? If so, how?

A. DISTINGUISHING IMPACTS OF SPECIFIC POLICY CHANGES

State welfare reform demonstrations usually include changes in several policies. For example, in recent years, many states have combined work requirements for welfare recipients with measures (such as more generous earned income deductions and higher asset limits) to make work pay. Policymakers and evaluators may want to distinguish impacts of specific welfare policy changes. Such analyses could identify which components of a welfare reform package are most effective in achieving particular goals.

The welfare reform waiver system has recognized the importance of distinguishing impacts of specific policy changes. When multiple welfare reform policies have been implemented under a single welfare reform waiver demonstration, the terms and conditions of Section 1115 welfare waivers have required the evaluator to "discuss the feasibility of evaluating the impact of individual provisions" of the total package. When multiple welfare reform policies have been implemented under separate welfare reform waiver demonstrations, the terms and conditions of Section 1115 welfare waivers have stated that "possible confounding effects from other demonstrations ...must be addressed in detail."

1. Issues

This section focuses on distinguishing impacts from particular welfare policy changes (as opposed to the whole package of changes), either in a random-assignment evaluation or in a nonexperimental one. The two main issues addressed are:

  1. How can impacts of particular policy changes (as opposed to the whole package) be measured using experimental methods?
  2. How can impacts of particular policy changes be measured using nonexperimental methods?

a. Distinguishing Impacts for Policy Changes Using Experimental Methods

The most rigorous way to distinguish impacts for specific policy changes is to employ an evaluation design with random assignment to multiple experimental groups. If there are several experimental groups, each exposed to different sets of policies, and a control group exposed to pre-reform policies, then the impacts of each set of policies can be measured and compared. Without such a design, the direction and relative size of impacts from two sets of policy changes can sometimes be inferred from the impact of both sets together.

Unlike an evaluation design with random assignment to a single experimental group, an evaluation design with random assignment to multiple experimental groups allows impacts to be estimated for multiple sets of policy changes, even if these changes were implemented simultaneously. For example, a welfare reform package may include both expanded earned income deductions and work requirements. Each set of provisions is likely to have positive impacts on employment rates. If an evaluation design included only one experimental group (X1), subject to both provisions, and a control group (N1), subject to neither provision, it would be impossible to distinguish separate impacts on employment. In contrast, if an evaluation design also included two partial experimental groups, one (X2) subject to the earnings incentives but not the work requirements, and the other (X3) subject to the work requirements but not the earnings incentives, the following impacts could be measured:

Moving from a two-group experimental design to a four-group experimental design multiplies by a factor of six the number of impact estimates that can be obtained from the evaluation. If a three-group experimental design were used (for example, groups N1, X1 and X2), then three impact estimates could be obtained from the evaluation: (X1 - N1), (X2 - N1), and (X1 - X2). However, it is helpful to be able to estimate X2 and X3 separately, since, because of interaction effects of earnings incentives and work requirements, it will not necessarily be the case that (X1 - N1) = (X2 - N1) + (X3 - N1).

When an experimental evaluation design includes only a single experimental group subject to all welfare reform policies, and a control group subject to none, distinguishing the impacts of separate policy changes is more difficult. Sometimes, even when two sets of policies are implemented simultaneously, there are theoretical grounds for attributing opposite signs (directions) to the impacts of each set. In this case, the sign of the impact estimate indicates which set of policies has the largest impact. For example, expanded earnings incentives are likely to increase welfare participation levels by broadening eligibility, while stricter work requirements are likely to decrease welfare participation levels by reducing leisure or by imposing financial sanctions for noncompliance. If the impact of the combined changes on welfare participation was positive, then the positive impact of expanded earnings incentives must be bigger than the negative impact of stricter work requirements.

In contrast, whenever the anticipated impacts from multiple policy changes are in the same direction, it is impossible to distinguish the impacts of specific changes with only one experimental group. For example, since expanded earnings incentives and stricter work requirements both are likely to lead to higher employment rates, the evaluator cannot assess the contribution of each policy change to the package's overall impact on employment, even with knowledge of the combined impact of these two provisions.

When welfare reform policies are implemented sequentially rather than simultaneously (usually for programmatic rather than evaluation reasons), additional opportunities may be introduced to infer impacts for separate policies in a two-group experimental design. For example, if expanded earnings incentives are implemented immediately, but stricter work requirements are added after 24 months, the first two years of estimated impacts can be attributed to the expanded earnings incentives alone.

Although the staggered implementation of welfare reform policies provides opportunities for inferring impacts of particular changes, care must be taken in determining the groups compared after the implementation of a second set of reforms. Welfare cases assigned to experimental or control groups before the second stage of implementation are likely to be affected by their initial exposure to only the first set of welfare reform policies. Evaluators should distinguish impacts on cases with staggered exposure to the two welfare reform packages from impacts on cases exposed to the packages in combination only.

b. Distinguishing Impacts Using Nonexperimental Methods

Regardless of whether the underlying evaluation design includes random assignment, evaluators can use several nonexperimental methods to attempt to assess the impacts of different welfare reform policies. First, a process study can often identify the components of a program likely to have been most (or least) effective through interviews with program staff and with clients; for instance, such interviews can identify components that were never implemented or that were misunderstood, versus components that were implemented well. Second, impacts of particular provisions of a welfare reform package may be analyzed by comparing outcomes for cases that participate in those components of the overall package (such as a JOBS program) to outcomes for cases that do not participate. Third, the staggered implementation of welfare reform policies in particular sites can help to distinguish impacts of different measures. Fourth, the evaluation design may call for certain research sites to implement only a subset of the total welfare reform package; this allows separate impacts to be estimated in a way similar to the use of a partial experimental group in an experimental design.

When the evaluation design does not incorporate random assignment, nonexperimental methods must be used. Unfortunately, these approaches are less likely than experimental methods to lead to reliable estimates of the impacts of different welfare reform provisions. The main disadvantage of nonexperimental approaches is that the groups being compared most likely differ not only by being subject to different policies, but also in other ways. For example, cases that decide to participate in a program are probably systematically different from cases that decide not to participate.(1) Similarly, when welfare reform policies are implemented in stages, applicants subject to both the first and second stage of reforms probably differ in systematic ways from cases initially subject only to the first stage of reforms. Finally, when different research sites implement different combinations of welfare reform policies, systematic differences probably exist between the sites that are confounded with the impacts of the policy combinations. Although statistical procedures such as multivariate regression can adjust for observed differences between different groups of cases, only random assignment can ensure that the unobserved characteristics of different groups of cases are, on average, the same.

Nonexperimental analysis of component impacts may be most useful as a supplement to experimental estimates of the impacts of the entire reform package. In this situation, the experimental design can be relied upon to determine whether a welfare reform package is associated with statistically significant differences in outcomes. Nonexperimental analyses (particularly process analyses) can help to establish which policy changes appear be most responsible for the observed impacts of the entire package.

2. State Approaches

Of the five waiver evaluations reviewed, only one--Minnesota's MFIP evaluation-- included multiple experimental groups. In the three urban counties participating in the MFIP demonstration, the research sample included two experimental and up to two control groups. The full MFIP experimental group (E1) received both the financial incentives and the case management provisions of the welfare reform package. The partial MFIP experimental group (E2) received the financial incentives portion of the welfare reform package but continued to receive JOBS (job-training) services under the pre-welfare reform rules. The AFDC + JOBS control group (C1) was subject to the full set of control policies, while the AFDC-only control group (C2) was not eligible for JOBS services. By comparing differences between these groups, it is possible to distinguish the impact of the full welfare reform package (E1 - C1) from the impact of the case management portion of the welfare reform package (E1 - E2), the impact of the financial incentives portion of the welfare reform package (E2 - C1), and the impact of current JOBS services (C1 - C2).

In two other states--California and Michigan--a two-group random-assignment design was originally adopted to study impacts from an initial set of welfare reform waivers and was subsequently used to study impacts from a combination of two waiver packages. In California, the APDP was implemented in December 1992 and the WPDP in March 1994. A two-group random-assignment design was adopted, with experimental cases subject to whatever reform policies had been implemented and control cases to neither set of welfare reform policies. Random assignment of applicant cases was scheduled to continue through December 1996. Presumably, cases that went through random assignment before March 1994 could be studied for up to 15 months to infer impacts from the APDP, while cases that went through random assignment after March 1994 could be studied to infer impacts from the combination of the APDP and WPDP. For cases that went through random assignment before March 1994, impacts measured after March 1994 would need to be attributed to the APDP plus the WPDP implemented some time later.

In Michigan, the first set of TSMF provisions was implemented in October 1992, and an additional set of provisions approved under a second waiver was implemented in October 1994. As in California, the evaluation sample consists of a single experimental group and a control group, with the experimental group subject to all welfare reform policies implemented to date. Random assignment of applicants was scheduled to continue until October 1996. The evaluator is planning to distinguish impacts for cases that applied for assistance before October 1994 from impacts for those that applied after October 1994. The evaluator has no plans to compare the impacts of the first package with the impacts of the combination of the two waiver packages, because the characteristics of applicants were different between 1992 and 1994. In addition, there are no systematic plans for distinguishing the impacts of separate waiver provisions within each major reform package, although the timing of particular provisions might allow some inferences to be made. For instance, for recipient cases, work requirements were not implemented until April 1993, but the first set of impacts that the evaluator for this group reported were measured as of October 1993, after work requirements had already been implemented.

Colorado's CPREP program includes a variety of welfare reform provisions in a single package; CPREP is being evaluated using a two-group experimental design. Currently, no efforts are under way to estimate separate impacts of the different provisions of this package.

In Wisconsin, the evaluator proposed distinguishing impacts of individual components by comparing participants in those components. As noted earlier, impacts of individual components estimated through these nonexperimental methods are likely to be less reliable than impacts estimated through experimental methods, because there is no guarantee that the underlying comparisons are between equivalent groups of cases.

All five of the evaluations we reviewed include process studies based in part on interviews with program staff and clients on their experiences with welfare reform. These interviews and related analyses will not enable evaluators to attach numerical values to impacts from separate provisions of welfare reform packages; however, they may help to identify particular provisions of each package that are strongly associated with particular outcomes from welfare reform.

3. Analysis and Recommendations

We recommend that states that want to estimate impacts for separate components of a welfare reform package consider evaluation designs with multiple experimental groups. Such designs (Minnesota's four-group MFIP design is an example) can be more informative to policymakers than the standard two-group experimental design. The major disadvantages of multigroup designs are that they require a larger research sample to achieve the same precision standards as two-group designs and that a state must administer three or more programs simultaneously in the research sites. To reduce the burdens of maintaining a four-group design, states may want to consider adopting a three-group experimental design, defining two experimental groups--a full experimental group subject to all of the welfare reform provisions and a partial experimental group subject to a subset of the welfare reform provisions--in addition to a control group. The policy changes from which the partial experimental group would be exempt would depend on the interests of the state but might include components of the proposed welfare reform package that are especially controversial or untested.

When states introduce a new welfare reform package after an evaluation of an earlier initiative has begun, we recommend that a second research sample be created, if possible; this would preserve the integrity of the research sample used to study the first initiative. The second sample would consist of recipient and applicant cases that are randomly assigned to either the earlier package only or to the combination of policies contained under both packages. Creation of a second research sample would require state officials to administer welfare under three different regimes, but it would make it much easier to distinguish impacts of the first and second set of welfare reform packages, for both recipients and applicants, in both the short and the long term.

If more than a two-group experimental design is not possible, we recommend that evaluators not attempt to estimate impacts for specific welfare reform provisions within the overall package. Our investigation of welfare reform waiver evaluations found no evidence that separate impacts for different welfare reform provisions can be distinguished reliably in the absence of a design with multiple experimental groups. Instead, we recommend that evaluators confine their analysis of separate welfare reform provisions to qualitative inferences obtained on theoretical grounds or through a process study that includes interviews with program staff and welfare recipients.

B. ESTIMATING ENTRY EFFECTS

Even in an experimental evaluation in which random assignment is implemented properly, the validity of impact estimates might be questioned if the adoption of welfare reform has induced substantial entry effects. Entry effects arise when the adoption of a welfare reform package either encourages or discourages applications for welfare, thereby changing the composition of the population of welfare applicants. For example, a welfare reform initiative that expands job- training programs might encourage applications for welfare, while a welfare reform initiative with stringent time limits might discourage applications. Entry effects do not bias impact estimates for the population that applies for assistance following the implementation of welfare reform. Nonetheless, when entry effects are present, impact estimates may not be valid for the population that would have applied for assistance in the absence of welfare reform. The terms and conditions of current Section 1115 welfare waivers state, "The evaluation contractor will explain how entry effects can be determined and will describe the methodology which will be employed to determine the entry effects of " the welfare reform program.

1. Issues

This section considers two issues related to the estimation of entry effects:

  1. Should efforts be made to estimate entry effects arising from welfare reform? If so, what data are needed and how should they be analyzed?
  2. Does the likelihood of entry effects call into question the validity of impact estimates from a random-assignment evaluation?

a. Feasibility of Estimating Entry Effects

One way to infer the direction of possible entry effects is to examine the impact of welfare reform on the exit behavior of recipient cases. As Moffitt (1993) noted, when welfare reform changes the benefits and potential earnings of welfare recipients and applicants, exit effects (effects on the probability of exiting welfare for recipient cases) are likely to be of the opposite sign as entry effects:

The conventional, "static" theory suggests that potential applicants as well as recipients continually compare two variables in making decisions to apply or exit: potential earnings in the private labor market, and the welfare benefit. Empirical research has strongly confirmed this theory, for welfare benefits and potential earnings have been shown repeatedly to have strong positive and negative effects, respectively, on the probability of being on AFDC at a point in time and on the probability of entering the rolls; and the probability of exiting the rolls has been shown to be negatively affected by benefits and positively affected by potential earnings.

For example, the imposition of time limits would tend to lower the expected value of welfare benefits, leading both to higher rates of exits by welfare recipients and to lower rates of application for welfare.(2)

Detecting exit effects in a particular direction may help to infer the direction of entry effects, but obtaining estimates of the size of entry effects requires the analysis of time-series data on applications to a state's welfare program. For example, if the data and analytic resources were available, monthly levels of applications could be studied over a multiyear period, adjusting for time-varying factors such as local unemployment rates, population changes, and the implementation of new policies (such as expansions of eligibility for welfare). To calculate application rates, applications could be compared to the population of potential welfare applicants, which could be estimated (at least in large states) from household survey data. Entry effects would be measured as the extent to which adjusted rates of application differ following the implementation of a welfare reform package. Exit effects could also be measured using aggregate time-series data on the size of the caseload of ongoing welfare recipients. Unfortunately, as with most estimates obtained from nonexperimental analyses, these estimates of entry and exit effects would probably be somewhat sensitive to the control variables included and the statistical assumptions underlying an evaluator's model of applicant behavior.

Moffitt (1992) proposed an experimental approach for measuring entry effects. This approach would involve randomly assigning the welfare reform and control policies to a large number of different sites, then comparing entry and exit rates in the sites. Moffitt notes that such an approach has many practical problems, including the difficulty of obtaining enough sites, the problem of cross-site migration, and the challenge of maintaining stable policies in each site for more than a very limited period of time. He concludes that, to obtain estimates of entry effects, a more feasible method is nonexperimental approaches that use administrative data.

b. Implications of Entry Effects for Results of a Random-Assignment Evaluation

If entry effects are detected, the validity of a random-assignment evaluation for the sample of applicants may be questioned. When entry effects are present, the number and characteristics of applicants are different than they would have been in the absence of welfare reform. As a result, impact estimates from the sample of applicants do not necessarily apply to the cases that would have applied for assistance had welfare reform not been implemented in the research sites. Nonetheless, as noted previously, impact estimates would remain unbiased for the population of actual applicants. If entry effects are small, impact estimates for applicants may still provide a good indication of the effects of the experimental policies on cases that would have applied for assistance in the absence of welfare reform.

Entry effects do not call into question a random-assignment evaluation's results for recipient cases. This is because, when welfare reform is implemented, recipient cases are already on welfare.

2. State Approaches

The five evaluations studied differed in how much they examined entry effects. Two evaluations have devoted substantial attention to this issue. In Wisconsin's WNW evaluation, entry effects are being estimated using aggregate and disaggregate time-series modeling of application behavior before and after the implementation of welfare reform. Early evidence from process analyses suggests that entry effects may be responsible for a large portion of the caseload changes arising from the WNW package; these findings will need to be confirmed through the time-series analyses described above.

In the APDP/WPDP evaluation, entry effects were being estimated using both administrative data and data from the Current Population Survey (CPS).(3) A time-series model of the fraction of "at-risk" women starting a welfare spell is being estimated using data from the early 1970s to the early 1990s, combining CPS data on the number of women at risk of becoming welfare dependent with monthly caseload data on the number of new welfare spells. To investigate exit effects, another time-series model is being estimated of terminations from AFDC, with separate analyses for the AFDC-UP caseload. Approximately 240 observations are being analyzed. The models control for benefit levels, birth rates, real wages, minimum-wage changes, unemployment rates, and key milestones in welfare policy (such as the OBRA changes of the early 1980s, which substantially reduced earnings disregards). In general, policy changes such as those adopted under OBRA were associated with large entry effects. Using this model, caseloads for the period following the implementation of welfare reform are being forecasted and compared with actual caseloads.

In the other states, efforts to determine entry effects were more modest or nonexistent. In Michigan's TSMF evaluation, the evaluator proposed asking questions in the client survey about possible entry effects, but no time-series analyses of applications or terminations were planned. In Minnesota's MFIP evaluation, the importance of entry effects was recognized, but no attempt was made to estimate them: "It was decided that time-series analyses will yield little reliable data since none of the [demonstration] counties are saturating their caseload with MFIP.(4) In Colorado's evaluation, no efforts are being made to estimate entry effects.

None of the evaluations we reviewed are using analysis of exit rates of recipient cases to infer the direction of possible entry effects arising from welfare reform provisions.

3. Analysis and Recommendations

The presence of large entry effects induced by welfare reform can call into question the validity of impact estimates for applicant cases, even if the evaluation features an experimental design. Only two of the five evaluations we reviewed included substantial efforts to study entry effects by analyzing application and termination behavior over time. If longitudinal data on applications are not available, time-series analyses are not feasible. If data are available, there may not be sufficient resources for the analysis in an evaluation largely focused on experimental impact estimates. Even if adequate longitudinal data and analytic resources are available for a particular state, the results of the estimation of entry effects may be sensitive to statistical assumptions employed by evaluators.

Remarkably little research exists on entry effects. Therefore, we recommend additional research on entry effects, which may be separate from random-assignment evaluations of state welfare reform initiatives, since the data collection and analytic needs for each type of study differ. Evaluations of entry effects could look at monthly welfare applications and terminations across several states, using standardized statistical methodologies and data sources such as historical caseload records from the states, the federal Integrated Quality Control System, or the Survey of Income and Program Participation. A major goal of studies using data from several states should be to identify the sorts of policy changes that are most likely to be associated with large entry effects over time. Another goal of such research should be to identify ways to combine nonexperimental entry effect estimates with experimental impact estimates to assess the overall consequences of a state's welfare reform program for applicant cases.

C. TREATMENT OF CROSSOVER CASES

Even under the best circumstances, a fraction of research cases in an experimental welfare evaluation probably will have their original experimental/control status contaminated (as discussed in Chapter IV). Such contamination can arise for several reasons:

These cases are commonly called crossover cases, since, in each instance, cases "cross over" from one set of policies to another. Some degree of crossover almost always occurs in a random- assignment evaluation.

Some terms may be helpful in discussing the implications of crossover. Migrant crossover cases are cases that experience a change in experimental/control status as a result of migration to a nonresearch site; merge/split crossover cases are cases that experience a change in experimental/control status as a result of a case merger or split; and administrative crossover cases are cases that experience a change in experimental/control status as a result of administrative error or manipulation. Crossover-type cases are research cases that would be inclined to migrate, merge, split, or otherwise change experimental/control status under at least one of the two sets of policies (experimental and control). Crossover-type cases include actual crossover cases and cases that would have crossed over had they been assigned to the other experimental/control group.

1. Issues

This section considers two analytic issues related to crossover cases:

  1. Should crossover cases be included or excluded from the analysis sample?
  2. Depending on how crossover cases are treated for analysis, should statistical corrections be employed to adjust for a possible bias to impact estimates?

a. Implications of Including or Excluding Crossover Cases from the Analysis Sample

Unless crossover cases leave the state, administrative records on these cases usually will be available even after they migrate to a nonresearch site, split from another case, or merge with another case. Specifically, as long as welfare participation and earnings information are stored in statewide data systems, it will be possible to determine if a research case is participating in welfare or has (UI-covered) earnings. Consequently, researchers will be able to include most crossover cases in the sample used to generate impact estimates.

If crossover cases are included in the analysis sample, the difference in mean outcomes between original experimental cases and original control cases will tend to understate the impact of welfare reform. This dilution results because some cases will have received a mixture of experimental and control group policies. The extent of bias in the impact estimates will depend on the extent of crossover, as well as on the manner in which the impact of welfare reform on crossover-type cases differs from the impact of welfare reform on noncrossover-type cases.

If all crossover-type cases could be identified, then these cases could be excluded from the analysis sample, and impacts could be estimated for noncrossover-type cases only. While these impact estimates would not be representative of impacts of welfare reform on crossover-type cases, they would be unbiased estimates of the impacts of welfare reform on noncrossover-type cases.

In practice, however, crossover-type cases cannot be identified perfectly, since it is uncertain which of the cases in each experimental/control group would cross over if they were subject to the other set of policies. A common practice is to exclude from the analysis sample cases that migrate to nonresearch sites, regardless of their experimental/control status. As long as migration does not depend on experimental/control status, this exclusion will eliminate from the sample both migrant crossover cases and the corresponding group of crossover-type cases. Similarly, to correct for merge/split crossover, all cases that merge or split could be deleted from the sample (although, in practice, it is often difficult to identify all merging or splitting cases within state administrative files).

If crossover behavior does depend on a case's experimental/control status, then excluding crossover cases from the analysis sample will lead to biased impact estimates. The size and direction of the resulting bias will depend on the incidence of crossover and the relationship between experimental/control status and crossover behavior. Biased impact estimates also will result from excluding crossover cases if crossover behavior is correlated with unobserved factors (such as motivation) that affect outcomes. It is not clear whether the bias in impact estimates from excluding crossover cases from the sample exceeds the bias from including these cases in the sample.

b. Statistical Corrections for Crossover

We consider corrections in two situations: (1) when crossover cases are included in the analysis sample, and (2) when crossover cases are excluded from the analysis sample.

Corrections When Crossover Cases Are Included. When crossover cases are included in the analysis sample, impact estimates (obtained as the difference of means between original experimental cases and original control cases) will tend to be diluted, because some original control cases will have been exposed to experimental policies. A proposed correction for this dilution is the Bloom correction (Bloom 1984; Bloom et al. 1993). In its simplest form, this procedure involves dividing the uncorrected impact estimate (the difference in mean outcomes for the experimental and control groups) by one minus the sum of the crossover rates for experimental and control cases. For example, if the crossover rate is 0.05 for experimental cases and 0.15 for control cases, the Bloom correction would involve dividing impact estimates by 0.80. The crossover rate may be measured in at least four ways for experimental and control cases:

  1. As the fraction of experimental cases currently subject to control group polices and as the fraction of control cases currently subject to welfare reform policies. This measure is most appropriate when prior exposure to the other set of policies has little or no effect on current outcomes.
  2. As zero for experimental cases and as the fraction of control cases ever subject to welfare reform policies. This measure is most appropriate when any exposure to welfare reform policies is equivalent to continual exposure to welfare reform policies.
  3. As the fraction of experimental cases ever subject to control group policies and as zero for control cases. This measure is most appropriate when any exposure to control group policies is equivalent to continual exposure to control group policies.
  4. As the fraction of time experimental cases have been subject to control group policies and as the fraction of time control cases have been subject to welfare reform policies. This measure is most appropriate when the impact of welfare reform polices depends on the percentage of time cases are exposed to these policies.

The larger the crossover rate, the larger the difference between the corrected and uncorrected impact estimates.

A major advantage of the Bloom correction is that it can be calculated in a straightforward manner if original experimental/control status is known and if actual crossover behavior is measured accurately. For the Bloom correction to be used, it is not necessary to know whether non-crossover cases are crossover-type cases (that is, whether noncrossover cases would have been crossover cases if they had been assigned to the other experimental/control group).

A major disadvantage of the Bloom correction is that it relies on a restrictive assumption about the similarity of crossover-type cases and noncrossover-type cases. The Bloom correction assumes that, if there were no opportunity for crossover to occur, the impacts of welfare reform would not differ for crossover-type cases and noncrossover-type cases, after controlling for observed characteristics. If impacts would differ for crossover-type and noncrossover-type cases, the Bloom-corrected impact estimate will be biased. It is possible that the impact of welfare reform on crossover-type cases would be larger than the impact of welfare reform on noncrossover-type cases. In these instances, the Bloom-corrected impact estimate will understate the true impact of welfare reform, although not as much as the uncorrected impact estimate. If the impact of welfare reform on crossover-type cases would be smaller than the impact of welfare reform on noncrossover-type cases, the Bloom-corrected impact estimate will overstate the true impact of welfare reform, and the true impact will lie somewhere between the Bloom-corrected estimate and the uncorrected estimate.

Another disadvantage of using the Bloom correction is that the underlying statistical procedure tends to reduce the precision of impact estimates over estimates obtained using an indicator for original experimental/control status. In certain situations, this loss of precision may be substantial.

Corrections When Crossover Cases Are Excluded. When crossover and presumed crossover-type cases (all cases that migrate, merge, or split) are excluded from the research sample, two problems arise that may benefit from the use of statistical corrections. The first is that the exclusion of cases from the analysis sample may introduce sample selection bias, either because of differential crossover between experimental and control cases or because crossover is itself correlated with unmeasured determinants of outcomes. The second problem is that, even if impacts estimated for the restricted sample were unbiased, the restricted sample may not resemble the total sample of research cases and the impacts estimated for this sample may not be the same as the impacts for the full sample.

If exclusion of crossover cases and presumed crossover-type cases introduces sample selection bias, then sample selection correction procedures, such as the Heckman correction, may be employed. Proper use of these procedures requires that variables exist that influence crossover behavior but not the outcomes of interest. Such variables may be difficult to identify, since anything influencing the decision to migrate, merge, or split may also influence program participation decisions, employment, and earnings. As with the Bloom correction, sample selection corrections generally reduce the precision of impact estimates.

Even if we could assume that exclusion of crossover and crossover-type cases from the analysis sample did not generate bias in estimating the impacts on cases that remain, the resulting analysis sample may not be representative of the original research sample. To narrow the differences between these two samples, reweighting schemes may be employed to make the two populations more similar. Unfortunately, any reweighting scheme can make the populations resemble each other across only a limited number of observed dimensions. Even after reweighting the analysis sample, differences between the analysis sample and the entire research sample are likely to remain (for example, in the degree of mobility of the cases in each sample).

2. State Approaches

The welfare reform waiver evaluations we studied had different approaches to including crossover cases in the analysis sample.

Wisconsin's WNW evaluation employed a comparison group design; the only crossover that could arise would involve migration out of a demonstration county. The demonstration counties chosen for the evaluation were small, so crossover to nonresearch counties (that is, counties without provisions such as time limits) was a major concern to the evaluator. The risk of crossover to the comparison counties was reduced somewhat by selecting noncontiguous counties for the evaluation, but crossover to other counties with pre-reform policies remained a concern. Efforts were being made to keep track of cases that had left a particular demonstration county for another location in the state.

For the California evaluations, the evaluator is attempting to track crossover cases throughout the state. AFDC and Medicaid participation information is available for the entire state, but AFDC benefit information is available for cases in the research counties only, which limits the ability to include crossover cases in certain impact analyses.

The Colorado evaluation excluded migrant crossover cases from the analysis, but only if they continued to receive welfare in nonresearch counties. If rates of migrant crossover or subsequent welfare receipt differed for experimental and control cases, then this practice would lead to biases in impact estimates. Merge/split crossover cases were always excluded from the Colorado analysis sample. To the extent that merge/split crossover rates differed for experimental and control cases, this would also lead to biases in impact estimates (although the incidence of merge/split crossover is typically small).

Michigan's evaluation deleted crossover cases and any other cases that left the research sites from the research sample. After processing data corresponding to the three years following the implementation of welfare reform, about one-fifth of the total research sample had been deleted for these reasons (in approximately equal percentages for experimental and control cases). To make the resulting sample more representative of the original research sample, a system of weights was developed that controlled for research site, recipient/applicant status, year of application, and number of adults in the case at baseline. All impact estimates in the fourth annual report (including third-year impacts for recipient cases) were estimated using these weights.

In Minnesota's evaluation, crossover cases were included in the analysis sample. Following standard MDRC procedures, impact estimates were generated using original experimental/control status, without employing the Bloom procedure or other corrections for crossover. The MDRC approach provides a precise lower-bound estimate of the impact of welfare reform on cases in the research sample. No analyses were reported showing the sensitivity of impact estimates to use of the Bloom correction or exclusion of actual crossover cases from the research sample.

3. Analysis and Recommendations

Welfare reform evaluators currently have several ways to approach crossover. For example, MDRC researchers generally include crossover cases in the sample and estimate impacts using an original experimental/control status variable, without any accompanying sensitivity analyses. Abt Associates, Inc., in its work on the Michigan evaluation, excluded all migrants and other crossover cases from the research sample and employed a weighting scheme to make the reduced sample representative of the original research sample.

We recommend that evaluators studying the impacts of a welfare reform initiative include crossover cases in the analysis sample whenever the available outcome data permit. This approach minimizes sample selection bias and avoids the need for reweighting. Including crossover cases in the research sample also makes it unnecessary for the evaluator to identify presumed crossover-type cases for systematic deletion from the sample. By not deleting crossover cases, original experimental cases remain comparable with original control cases. Because cases that subsequently cross over remain in the research sample, impact estimates over time will apply to the same sample of cases.

In addition to including crossover cases in the analysis sample, it is important that evaluators identify the extent to which crossover behavior occurs for the experimental and control groups. As noted earlier, there are at least four ways of measuring crossover. The preferred measure will depend on the nature of the intervention being evaluated. Evaluators need to be clear about how they define crossover; they may also want to consider the sensitivity of crossover rate estimates to the way in which crossover is defined.

As long as the incidence of crossover is low, any sort of statistical correction should generate impact estimates similar to the uncorrected impact estimates. However, if the incidence of crossover is high, it is important that welfare reform evaluations provide sufficient information to compare results obtained using different approaches and methodologies.

When generating estimates of the impacts of welfare reform, we recommend that the primary focus be on impacts estimated using the original experimental/control status of research cases (with regression adjustments applied to increase the precision of these estimates). This approach always provides a lower bound on the true impact estimate. In contrast, the Bloom procedure is less precise and provides downwardly biased estimates if impacts are greater for crossover-type cases than noncrossover-type cases, and upwardly biased estimates if impacts are greater for noncrossover-type cases than crossover-type cases. Knowing the Bloom-corrected impact estimate for each outcome may still be valuable for sensitivity analyses, so we recommend that these estimates be included in appendixes to the impact study reports.

D. ESTIMATING IMPACTS FOR SUBGROUPS DEFINED BY EVENTS SINCE RANDOM ASSIGNMENT

No matter how carefully an evaluator assembles administrative or survey data for an impact analysis, outcome or background data will be missing or invalid for at least a small fraction of the sample. Frequently, however, outcome or background data are missing or invalid for a large portion of the research sample (perhaps 20 percent or more). One reason for this situation is that the data were not collected completely by administrators or survey workers, although the corresponding information is (in theory) available for all relevant cases. Another reason for missing or invalid data is that the outcome itself is defined by program-related events subsequent to random assignment (such as participation in welfare or in job training), so no outcome data can possibly be collected for some cases. Both situations present problems for the analysis of impacts from the welfare reform package.

1. Issues

This section discusses two issues related to the use of samples defined by events since random assignment:

  1. Should impacts be estimated when administrative or survey data are incomplete or clearly incorrect for a large fraction (one-fifth or more) of the sample?
  2. Should impacts also be estimated for subgroups defined by employment or program participation decisions since random assignment? If so, how?

a. Estimating Impacts When Data Are Incomplete or Incorrect

When considering whether to omit observations from the sample because of incomplete or clearly incorrect administrative and survey data, it is important to distinguish (1) background information, and (2) outcomes data.

Missing or Invalid Background Information. Background information may be omitted for certain cases because of omitted or incorrect values in administrative records or because of survey nonresponse, item nonresponse, or invalid responses to baseline surveys.

The presence of certain background information is essential for an observation to be included in the analysis sample. Welfare reform evaluations generally distinguish impacts for experimental and control cases in recipient and applicant samples, so the presence of an original experimental/control status variable and an applicant/recipient variable is essential for every observation. To construct certain outcome variables (such as employment and earnings) a valid SSN is usually required for matching with UI wage records.

In other instances, observations may be included in impact analyses even if background information is incomplete. For instance, information on the demographic characteristics of a household is valuable for the construction of descriptive statistics and for increasing the precision of impact estimates. However, such information is not essential for impact estimates. Regression-adjusted means can still be calculated by imputing the missing background information, or by setting the missing information equal to a default value and adding indicators for the missing values, without introducing bias. Excluding a large number of cases with missing background information risks making the analysis sample less representative of the entire research sample, since particular types of recipient or applicant cases might be less likely to provide valid background information.

Missing or Invalid Outcomes Data. Outcome information may be omitted for certain cases because of omitted or incorrect values in administrative records or because of survey nonresponse, item nonresponse, or invalid responses to client surveys.

The presence of outcome information is usually essential for obtaining impact estimates. Imputing missing values of nonessential background variables is unlikely to bias impact estimates. Imputing values of the outcome variables themselves is more questionable, however, since it assumes that the relationship between background information and outcomes is the same for cases with missing information as for cases with nonmissing information.

If observations with missing outcomes data are excluded from impact analyses, then biased impact estimates may result if the incidence of missing outcomes data differs for experimental and control cases, or if observations with missing outcomes data differ from other observations in some systematic way correlated with the outcome variables. In these situations, use of a sample selection procedure may be possible, provided that at least some background information is available for the cases with missing outcome information and that a background variable can be identified that is correlated with the absence of outcomes data but is not correlated with the outcomes themselves. Assuming such a variable can be identified (which is not certain), correcting for a possible sample selection bias in impact estimates must still be balanced against the loss of precision in impact estimates that such corrections entail.

b. Estimating Impacts for Subgroups Defined by Decisions Since Random Assignment

Even if administrative records and survey data are complete, certain outcomes will only be available for a subset of the research sample that is defined by behavior or events since random assignment. For example, when estimating welfare recidivism rates, the sample must be limited to cases that had left welfare within the follow-up period. This sample is only a part of the entire sample and, if experimental policies induce a different pattern of exits from welfare, the baseline characteristics of experimental and control cases in the subsample will differ. Another example of an analysis using a subsample defined by behavior after random assignment would be an analysis of JOBS participation rates among current welfare recipients.

A random-assignment design may be helpful in dealing with certain problems related to the use of such samples. In particular, experimental status variables may be useful in correcting for sample selection because of decisions since random assignment. For example, if a study was seeking to estimate labor market outcomes for JOBS participants, experimental status could be used in a sample selection procedure as a predictor of whether particular individuals would participate in JOBS.(5)

2. State Approaches

In general, the five evaluations we reviewed handled instances of missing data in the same manner: observations with missing outcomes or background data were not included in the impact analyses. The loss of observations from missing data appears to have been small in most cases. For example, Minnesota's evaluation reported that only one percent of the research sample failed to complete a background information form, and less than five percent of the research sample was excluded from the six-month impact analyses because of missing welfare participation data. The Michigan and Minnesota evaluations proposed using sample selection procedures to adjust impact estimates in instances in which a large portion of the sample contained missing data; in practice, however, these procedures were not employed because the portions of the sample with missing data were small.

For the four random-assignment evaluations, subgroups usually were defined on the basis of baseline characteristics rather than events since random assignment. In certain instances, however, deletions from the sample may have reduced the strict comparability of the experimental and control groups. In Michigan's evaluation, for example, cases active for only one month were deleted from the research sample, reducing the size of the sample by between two and three percent after two years of data had been processed. The evaluator justified this decision because "a one-month eligibility period for AFDC or SFA is somewhat unusual and therefore suspect . . . we assumed that cases active only one month would have left with or without exposure to TSMF and should not be considered part of the demonstration.(6) As noted earlier, denied applicants for both AFDC and SFA were excluded from the Michigan sample because of data limitations; the evaluator argued that there was no evidence that these deletions introduced "important intergroup differences in baseline characteristics" of experimental and control cases.(7)

In one instance, an evaluator reported outcomes for subgroups defined by events since random assignment, but with an important qualification. In its report on six-month impacts from Minnesota's MFIP initiative, MDRC reported outcomes such as welfare benefits of active cases and earnings of employed single parents on welfare. Mean values were distinguished for experimental and control cases, but without any tests of statistical significance of experimental- control differences. A comment in the text noted that "the subset of the MFIP group for whom averages are shown may not be comparable to the subset of the AFDC group for whom averages are shown. Thus, effects of MFIP cannot be inferred from this table.(8)

In Wisconsin, certain outcomes were being collected only for cases that leave AFDC, but the evaluator had not decided procedures for correcting for selection into this sample. Because the WNW evaluation design is nonexperimental, the evaluator is giving more attention to collecting data that could be useful in modeling participation in particular welfare reform programs.

3. Analysis and Recommendations

Properly implemented, random assignment ensures that the baseline characteristics of experimental and control cases are, on average, the same. A major advantage of this equivalence at baseline is that subsequent differences between experimental and control groups can be attributed entirely to exposure to welfare reform policies. When the sample used to analyze impacts from welfare reform is reduced in size, either because of incomplete or incorrect data or because of the analysis of subgroups defined by program-related events since random assignment, the strict comparability of the experimental and control samples may be lost.

The problem of incomplete or incorrect data can be reduced through state efforts to ensure the quality of administrative records and through evaluator efforts to increase survey response rates. We also recommend that evaluators use all observations for which valid outcome data and basic baseline characteristics are present, rather than restricting the sample because nonessential baseline information is missing. If desired, evaluators may impute missing values of nonessential baseline characteristics. By not deleting observations needlessly, both large sample sizes and the representativeness of the overall research sample are preserved. This makes the resulting impact estimates more applicable to the entire population of cases from which the research sample was drawn.

In defining outcomes for inclusion in the impact study, evaluators of state welfare reform programs should adopt analytic strategies that take maximum advantage of the strengths of an experimental design. In particular, we recommend that evaluators define outcomes in ways that enable values to be assigned for all or nearly all recipient and applicant cases in the research sample. When there is interest in a particular outcome for a subgroup defined by events since random assignment (such as recidivism rates for cases that have left welfare), we recommend that alternative outcomes be considered for analysis (such as the number of welfare spells or the percentage of months spent on welfare since random assignment).

Notes

(1)The use of sample selection correction procedures such as the Heckman correction (Heckman 1979) can account for cases' participation decisions. Estimates obtained using these procedures may be sensitive to underlying statistical assumptions, however, and usually are less precise than ordinary least squares estimates.

(2)Sometimes, however, there will be no such correspondence between entry and exit effects, since certain policies (for example, diversion payments or some AFDC-UP expansions) will not apply to current welfare recipients but only to new applicants.

(3)The study of entry effects is separate from the rest of the evaluation. It is being conducted by Professor Michael Wiseman at the University of Wisconsin.

(4)"Manpower Demonstration Research Corporation (1994). Proposal Design and Workplan for Evaluating the Minnesota Family Investment Program. New York: MDRC, p. 28.

(5)If the study also wanted to look at the direct effect of JOBS participation on labor market outcomes, experimental status could be used in a two-staged least squares procedure to predict JOBS participation.

(6)"Werner, Alan, and Robert Kornfeld (1996). "The Evaluation of To Strengthen Michigan Families: Fourth Annual Report: Third Year Impacts." Cambridge, MA: Abt Associates, Inc., p. B-1.

(7)Werner and Kornfeld (1996), p. A-4.

(8)"Knox, Virginia, et al. (1995). "MFIP: An Early Report on Minnesota's Approach to Welfare Reform." New York: MDRC, pp. 4-9.


Where to?

Top of Page
Table of Contents

Home Pages:
Human Services Policy
Assistant Secretary for Planning and Evaluation
U.S. Department of Health and Human Services

Updated 09/24/01