ACF OPRE: Underreporting of Welfare Utilization in Current Population Survey Data Evidence for Matched California Administrative Data (CPS)

4. Extrapolating to the Full Data

The previous chapter analyzed the congruence of responses among those who provided an SSN (what we referred to as the matched sample) and met a set of sample inclusion criteria. However, as Table 3.1 notes, many people do not provide an SSN. Furthermore, not providing an SSN is differential. People who are more disadvantaged are less likely to provide an SSN, but people who are on welfare are slightly more likely to provide an SSN.

The fundamental problem that is addressed in this chapter is that we do not have SSNs for about half of the sample. We do not want to assume that the responses in the unmatched sample are perfect. Instead, we want to use information from the reporting errors in the matched sample (where we have the MEDS information, treated as truth) to perform better imputations of program enrollment in the unmatched sample. The basic idea is that individuals with characteristics associated with under-reporting in the matched sample are more likely to under-report in the unmatched sample. We estimate a logistic regression model of such reporting errors (both under-reporting and over-reporting) on the matched sample. We then use that model to multiply impute a true response in the unmatched sample; where by multiple imputation we mean that we assign a probability of each response to each individual based on the regression model.

In practice, we have one more piece of information. We can estimate the total number of people in the unmatched sample who are enrolled in a program. To do so, we take the total estimates from the administrative data and subtract the estimates of enrollment in the matched sample (i.e., we use the CPS weights and the MEDS/administrative data information). Our logistic regression models in general under-predict the number of program enrollees in the matched sample. We therefore append a multiplicative adjustment factor. The effect of that adjustment factor is to force the imputed number of program enrollees to exactly match the administrative totals.

The balance of this chapter provides a precise mathematical discussion of the problem and our approach. The discussion in this chapter is formal and technical. Many readers will want to skip to the next chapter where we provide the substantive results.

The Identification Problem

We can conceptualize the CPS matching problem as a table including eight “cells,” in terms of total weighted counts. The columns distinguish whether the individual is on the program according to the MEDS (i.e., YES/NO). The rows distinguish both the CPS response and whether the record has an SSN (so it is potentially matchable). The letters name the cells to ease the discussion below.

			MEDS
			YES	NO	Total
CPS	SSN	YES	A=TP_S	B=FP_S	C=Y_S
	SSN	NO	D=FN_S	E=TN_S	F=N_S
	ABSENT	YES	G=TP_A	H=FP_A	I=Y_A
	ABSENT	NO	J=FN_A	K=TN_A	L=N_A
Total			M=Y_M	N=N_M	O=T

Thus, the subscripts are:

“S”—SSN present (i.e., a match was in principle possible; in practice, we drop the bad matches as well);
“A”—SSN absent (i.e., a match is not possible);
“M”—MEDS.

And the other codes are:

TP—true positive;
TN—true negative;
FN—false negative;
FP—false positive.

And finally:

Y—“Yes” (on Medi-Cal/welfare);
N—“No” (not on Medi-Cal/welfare);
T—Total.

We treat the MEDS data as “truth.” Thus, our goal is to use the MEDS data to “fix” the CPS data. From the records that provided SSNs, we know TP_S, FP_S, FN_S, and TN_S. So, we simply adjust the CPS answers to align with the MEDS answers.

The challenge, therefore, is the unmatchable data—those records for which no SSN was available in the CPS.18 For those records, we only know the row totals—Y_A, N_A, — and the column totals by subtraction from the “S” sample—TP_A+FN_A=Y_M-TP_S-FN_S and FP_A+TN_A=Y_M-FP_S-TN_S. However, there is some additional—we will see, not quite enough—information from the matched sample.

To understand our approach, begin by formally defining the imputational false positive rate and the imputational false negative rate as:

(4.1) ρⁱ_FP= (FP_S)/(TP_S+FP_S)

ρⁱ_FN= (FN_S)/(TN_S+FN_S)

where the “i” is for imputation and these are the rates with respect to the CPS answers (as opposed to the behavioral rates in terms of the true behavior as measured in the MEDS that we also considered in the previous chapter).

These imputational rates are in contrast to the behavioral rates of the previous chapter. The tabulations there addressed the behavioral question. Given the true status, what is the probability of a false response? This is not a useful concept for imputation. In the CPS, we observe the potential false response and want to infer the true status. To do so, we want the imputational rates, i.e., the probability that the true status is different than the observed response, given the observed response. The two sets of rates are exactly related. From a complete 2x2 contingency table (i.e., TP, FP, TN, FN), we can compute both sets of rates. From one set of marginals and one set of rates, we can recover the other set of rates. Which rate is more insightful depends on whether we are addressing behavioral questions (as in the last chapter) or imputational questions (as in this chapter).

Then, if we knew these imputational error rates, we could probabilistically impute the data. We would create two pseudo-observations for each observation (dividing the sample weight between the pseudo-observations). So, for example, if an observation reported “Y” in the CPS, that observation would be assigned a “Y” with probability 1 - ρⁱ_FP and "N" with probability ρⁱ_FP. Similarly, if an observation reported "N" in the CPS, that observation would be assigned a "N" with probability 1 - ρⁱ_FN and a "Y" with probability ρⁱ_FN.

We do not know these rates in the unmatched sample. Furthermore, the rates from the matched data are not directly applicable in the unmatched data. If the rates from the matched sample applied in the unmatched sample, then applying those rates to the unmatched data would recover the actual number of people on Medi-Cal/welfare in the MEDS, i.e.:

(4.2) Y_M = TP_S + FN_S + Y_A(1 - ρⁱ_FP) + N _A

However, we have already noted that the under-reporting in the matched sample is not large enough to explain the under-reporting in the full sample.

We have a fundamental non-identification problem: One equation for YM and two unknowns—the rates in the unmatched data. Setting one of the rates fixes the other rate.

Given that in net we have under-reporting of Medi-Cal/welfare and false positives are rare (and relatively stable through time), we adopt the simplest rule. We use the false positive rate from the matched data in the unmatched data. We then adjust the false negative rate (by a multiplicative factor, “α”) until the implied total count of people on Medi-Cal/welfare in the CPS equals the count in the MEDS (assumed to be truth).

(4.3) Y_M = TP_S + FN_S + Y_A(1 - ρⁱ_FP) + N _Aα

The left-hand side is the “true” number of individuals on Medi-Cal/welfare from the MEDS. The right-hand side is the “fixed” number of individuals on Medi-Cal/welfare in the CPS. Considering each of those terms in turn:

TP_S + FN_S: The number of people who have Medi-Cal/ welfare in the matched sample (true positives plus false negatives).
Y_A(1-ρⁱ_FP): The number of people who report having Medi-Cal/welfare who actually do. We know the number of people who report having Medi-Cal/ welfare in the unmatched sample. We estimate the number of these people who actually do have Medi-Cal welfare using the imputational false positive rate from the matched sample. This is the identifying assumption.
N _Aα: The number of people who report not having Medi-Cal/ welfare who actually do. Again, we know the number of people who report not having Medi-Cal/welfare in the unmatched sample. Finally, a gives the probability of a false negative in the unmatched data.

Solving for a, the false negative rate in the unmatched sample yields:

(4.4) α = (Y_M - TP_S - FN_S - Y_A(1 - ρⁱ_FP))/(N_A)

Except for the false positive rate, each of the terms on the right side is observable. In the numerator, the first term is the number of people on Medi-Cal/welfare from the MEDS. The second and third terms are the number of people on Medi-Cal/welfare from the MEDS in the matched sample. The fourth term is the product of the number of number of people in the unmatched sample who claim to have MediCal/welfare and the fraction of them who are estimated to actually have Medi-Cal/welfare (i.e., one minus the imputational false positive rate). The denominator is the (weighted) number of people in the CPS sample who do not provide an SSN who claim not to have Medi-Cal/welfare. The false positive rates for the unmatched data are not observed, but by assumption we use the value estimated from the CPS. Since the CPS is a sample, each of these concepts should be weighted.19

Imputing the Data

This analysis of identification suggests that we are missing one piece of information. However, once we assume that the false positive rate is common in the matched sample and the unmatched sample, we can solve for a. Then, knowing a is enough to solve for each of the cells:

(4.5)

TP_A = Y_A(1-ρⁱ_FP)
FP_A = Y_Aρⁱ_FP
FN_A = N_Aα
TN_A = N_A(1-α)

Cell counts for the terms in individual years are often too small to allow public release. However, the totals over the whole 11-year period are releasable. To understand our methods, Equation 4.6 and Equation 4.7 show the actual numbers for Medi-cal and welfare respectively, summing over all 11 years (rounded to hundreds of thousands).

(4.6)

α = (Y_M - TP_S - FN_S - Y_A(1-ρⁱ_FP)/(N_A)
= (31.4-10.6-5.0-6.1)/(4.0) = 2.4 ≈ 2.4

(4.7)

α = (Y_M - TP_S - FN_S - Y_A(1-ρⁱ_FP)/(N_A)
= (13.9-3.6-4.5-1.5)/(3.1) = 1.4 ≈ 1.4

Thus, over the full 11 years, the MEDS has 31.4 million adults on Medi-Cal. The weighted matched CPS data have 10.6 million true positives and 5.0 million false negatives. Using the imputational false positive rates and false negative rates, we would estimate 5.0 million false positives and 6.1 million false negatives in the unmatched sample. To align the CPS totals with the MEDS totals, we need to increase the false negative count by a factor of 2.4 (i.e., from 4.0 to 9.6).

For welfare, the MEDS has 13.9 million adults on welfare. The matched CPS data have 3.6 million true positives and 4.5 million false negatives. Using the imputational false positive rates and false negative rates, we would estimate 1.5 million false positives and 3.1 million false negatives in the unmatched sample. To align the CPS totals with the MEDS totals, we need to increase the false negative count by a factor of 1.4 (i.e., from 3.1 million to 4.3 million).

Stratifying and Adjusting for Covariates

The above analysis is applicable when the population is homogeneous. In reality, the population is heterogeneous. We are able to address this to some extent. We have a small number of variables—calendar year (in principle, also gender and age)—that are measured (nearly) consistently in the MEDS and the CPS. For these variables, we can totally stratify (i.e., we will compute a different value of a for every strata).

In addition to the small number of variables that are common to both data sets, we have many other covariates in the CPS. This allows us to estimate the false negative and false positive rates in the matched sample, not only in terms of the small set of common variables, but also in terms of the larger number of variables in the CPS alone. We do so using exactly the imputational regressions discussed in Chapter 3.

Using these multivariate models seems particularly important for two reasons. First, many of these variables are likely to be strongly related to Medi-Cal/welfare eligibility and therefore to true MediCal/welfare coverage, e.g., marital status, presence of children in the household, and household earnings. Second, dual coverage (Medicaid and also other, usually private, health insurance) is an issue of substantive interest. As much as possible, we want to correctly impute in the sub-samples with and without private health insurance.

Suppose that within the strata, s, we can assign each individual his/her own (thus the i superscript), then we can write our equation for α as:

(4.8)

Y_M,s = TP_S,s + ∑_{jÎY_A} W_j (1-ρⁱ_FP [j]) + α_S∑_{kÎN_A} W_k ρⁱ_FN [k]

α_S = (Y _{M ,s} - TP_{S ,s} - FN _{S ,s} - ∑_{jÎY_A} W_j (1-ρⁱ_FP [j])) / (∑_{kÎN_A} W_k ρⁱ_FN [k])

In practice, we use the predictions of the imputational logit regression models from the previous chapter to estimate the s.20

For this project, we have the matched data. However, given the imputational logit model, this approach can also be applied to the CPS public use data by those who do not have the matched data. To see this write

(4.9)

Y_{M ,s} = ∑ _{jÎY_SÈY_A} W_j (1- ρⁱ_FP [j]) + γ_S ∑ _{kÎN_SÈN_A} W_k ρⁱ_FN [k]

γ_S = ( Y_M,s - ∑ _{jÎY_SÈY_A} W_j (1- ρⁱ_fp [j]) ) / ( ∑ _{kÎN_SÈN_A} W_k ρⁱ_fn [k] )

Below, we compute a and for each strata. Thus, an analyst without access to the matched data could also create an imputed data set.

In what follows, we apply this approach directly to our CPS data. We stratify by year. In practice, the estimates within demographic sub-groups are too small to yield reliable estimates. Table 4.1 presents the resulting estimates for a. For adults, the adjustment factors are quite large in the early years. For Medi-Cal, we need to triple or even quadruple the false negative rates in the early years, suggesting that the unmatched sample is very different from the matched sample. For welfare, despite the fact that the under-reporting is absolutely more severe, the unmatched sample is closer to the matched sample. The highest adjustment factors are only slightly greater than two.

Over the 11 years covered by our analysis, the adjustment factors shrink. By 2000, the adjustment factor for Medi-Cal is under 1.5 and for welfare, under 1. It is not clear whether these changes over time result from changes in the CPS instrument or from changes in who is receiving welfare. The large drop in 1995 is consistent with the desired effects of the change in the CPS instrument in that year. The drop in 2000 would also be consistent with the changes in the CPS instrument in that year. Unfortunately, the drop seems to date back to 1999, one year too early.

Table 4.1.

Adjustment Factors α
Year	Adults		Children
Year	M	W	M	W
1990	3.0	1.8	1.7	3.3
1991	4.2	2.0	2.3	4.1
1992	3.7	2.2	2.6	4.0
1993	3.2	1.8	3.1	5.2
1994	3.7	1.3	1.0	3.3
1995	2.3	1.3	2.0	3.4
1996	2.2	1.2	1.7	3.0
1997	2.9	1.5	2.2	3.8
1998	2.2	1.3	2.0	3.7
1999	1.4	1.2	1.3	2.8
2000	1.4	0.7	2.4	4.9
Average	2.4	1.4	2.0	3.7

The preceding discussion applies to adults, for whom we potentially have an SSN. We do not have SSNs for any children. Following Census practice, we impute from parents to children.21 For Medi-Cal this is consistent with Census’s logical imputations. If parents have Medicaid, then children are imputed to have Medicaid. For welfare, this is definitional. Children are never asked about welfare in the CPS. Instead, we impute welfare to both adults and children based on the receipt of public assistance from a welfare program. We then follow the equivalent approach; in other words, we adjust the false negative rate until it aligns the imputed data with the MEDS totals.

We note that the adjustment factors for children are much higher. Furthermore, unlike the adjustment factors for adults, the adjustment factors for children do not fall over time. These adjustment factors are large enough to cast some doubt on the quality of the imputations for children. The adjustments will align the total number of children with the control counts. Our methods impute to children in proportion to the false negative rates. This continues to be a reasonable approach. However, the adjustment factors are so large as to suggest that there is some factor beyond false negatives explaining the under-reporting for children. Whatever it is, the matched data do not identify it.

Conclusion

Given the adjustment factors shown in Table 4.1, we create a multiply-imputed data set. For the matched data, we overwrite the CPS data with the MEDS data. For the unmatched “Yes” responses, we multiply impute based on the false positive rates implied by the logit regression coefficients from the matched sample. For the unmatched “No” responses, we multiply impute based on the product of the false negative rates implied by the logit coefficients from the matched sample and the adjustment factor, α, for this survey year. In practice, we create two data sets, one for the analysis of Medi-Cal (and health insurance) and a second for the analysis of welfare. We do not attempt the full joint imputation of Welfare and Medi-Cal Only.

These multiple-imputation models appear to be the best that can be done with the available data—including the CPS-MEDS match. For those who provide an SSN (our “matched sample”), we use the MEDS data to impute program participation. For those who do not provide an SSN (our “unmatched sample”), we use a multivariate model estimated on those providing an SSN to impute program participation. If the unmatched sample were identical to the matched sample except that they did not provide an SSN, the αs would be close to one. In fact, in most cases, the αs are greater than one, suggesting that people who do not provide an SSN are—even conditional on covariates—more likely to be program participants than those who do not provide an SSN.

In the subsequent chapters we use the multiply-imputed data sets to obtain a better understanding of how the mis-reporting in the CPS can affect different types of analyses. Specifically, in Chapter 5, we examine how mis-reporting of Medi-Cal receipt affects estimates of the number of uninsured in California and in Chapter 6 we look at how mis-reporting of Medi-Cal and welfare participation affect estimates of program take-up.

These substantive estimates in the next two chapters use the results of the model. Some of the as are so large (well above 2, especially for children) to suggest that the results should be used with some caution. We nevertheless believe that they represent an improvement over estimates that make no correction for under-reporting or even estimates that correct for under-reporting without access to matched data.

18 Any errors in coverage of the CPS are likely to have their effects at this stage. Consider the possibility that the CPS systematically and disproportionately misses (i.e., fails to interview) those with welfare/Medicaid and those errors are not corrected by CPS reweighting (for region, urban/rural, gender, age, race-ethnicity). The MEDS and the MEDS counts would include these people. The CPS and the CPS counts would not. Our approach will attribute this to under-reporting. In fact, the problem would be, not that respondents responded falsely, but that people with welfare/Medicaid were not interviewed.

This problem is likely to be most severe among the institutionalized population. They are not included in the CPS sampling frame. They are in our MEDS counts. We partially address this issue by dropping those over 65. It seems likely that a better estimate of the number of people with Medi-Cal in institutions would considerably lower our estimated false negative rates in the unmatched population relative to the false negative rates in the matched population. This is the a we define below. This issue probably explains much of why a is greater than 1. (back)

19 We note that this is the analysis considering the concepts (Medi-Cal, welfare) separately. It would also be of interest to impute jointly welfare and Medi-Cal. Table 3.3 and the probit regressions reported in Table 3.8 provide the inputs for such an analysis.

We do not perform the full imputation here. The actual imputation would be more complicated than the single imputation attempted here. The single imputation considered here is for a 2x2 table, with two error rates. Fixing one of the error rates is enough to allow computation of the other one from the data. In contrast, the joint response problem is a 3x3 table, with six distinct error rates. We need to fix four of them to be able to compute adjustment factors for the last two. By analogy, with the approach in the body of the paper, it would be natural to assume that the three upcoding error rates are common. However, that is not sufficient. We still need to fix either the P[W|N] or P[W|MO]. We have seen that both of these errors are common and changing over time, so it is not clear how to proceed. (back)

20 The form of the equation in the text is computationally straight-forward. One could argue that it would be more consistent with the logit modeling strategy to include a inside the logit index. Doing so would require a non-linear optimization to compute a. (back)

21 See for example the March CPS documentation for 1990 (p. 9-8; http://www.census.gov/apsd/techdoc/cps/cpsmar00.pdf): “After data collection and creation of an initial microdata file, further refinements were made to assign Medicaid coverage to children. In this procedure all children under 21 years old in families were assumed to be covered by Medicaid if either the householder or spouse reported being covered by Medicaid (this procedure was required mainly because the Medicaid coverage question was asked only for persons 15 old and over). All adult AFDC recipients and their children, and SSI recipients living in States which legally require Medicaid coverage of all SSI recipients, were also assigned coverage.” (back)

Table of Contents | Previous | Next