ACF OPRE: Head Start Impact Study and Follow-up

Appendix 4.1: Imputations for Item Nonresponse in the Fall 2002 Data

To facilitate analysis of the data, and to ensure that the results obtained by different analysts are consistent with one another, it is desirable to impute missing responses to produce as complete a data set as possible. Imputation also helps to control for nonresponse bias and produce a more representative file for analysis. For example, many software packages select only the cases that are complete on the set of variables analyzed and ignore the cases with incomplete data. Discarding incomplete cases is inefficient, but more seriously, the complete cases may not be representative of the target population; consequently, estimates derived from them are subject to nonresponse bias.

For this study, missing values for fall 2002 variables due to item nonresponse were imputed using hot deck imputation. Hot deck imputation is a procedure where cases with missing values for specific variables have the “holes” in their records filled in with values from other similar cases. Because the imputed values come from actual respondents’ values, hot-deck imputation has the desirable property that imputed values are always realistic and preserve the underlying sampling variation in the data.

The “donor” case from which the imputed value is taken (also referred to as the respondent), is randomly selected from a pool of similar children who are matched to the “recipient” (or nonrespondent) on characteristics that are correlated with the variable being imputed. The aim is to construct pools (or imputation classes) that explain as much of the variance in the variable to be imputed as possible, but are of adequate size so that there is some minimum number of respondents in each class, and donors are not reused too many times. The assumption is that within each imputation class, the mechanism that leads to missing data is “ignorable”; that is, the missing values are as though they were missing at random. This means that the probability that a value is missing can depend on the values of the imputation class variables but, within class, not on the missing outcome values. If implemented carefully, hot deck imputation can preserve the distribution of the data on measured variables so that estimates of distributional characteristics such as percentiles, variability, and correlation will not be distorted. However, if the item response rate is very high, a small percentage of imputed data will have very little effect on the distribution of the variable regardless of the imputation method.

The variables used to form imputation classes or cells were identified from chi-square tests of association and bivariate correlation coefficients. In some cases, they were also determined by skip patterns in the parent questionnaire and other requirements of logical consistency between questionnaire items. The imputation cells were created by cross-tabulating all of these variables at once. A donor was allowed to be used up to three times. When no more donors were available in an imputation class, adjacent cells were collapsed. The order of collapsing was specified so that levels of the least correlated cell variable were collapsed first, followed by the second least correlated variable, etc. until a donor was found. Imputed values have been flagged so that an analyst has the option of not using the imputed data, such as when analyzing the effects of the imputed data on the results.

We imputed missing data for all fall 2002 demographic variables and the fall 2002 measures of each of the spring 2003 outcomes (e.g., parenting practices, child health, assessment scores, child socio-emotional behavior, and other scale variables). The variables that underwent imputation and their item nonresponse rates for the analysis sample used in this report (the spring 2003 child assessment respondents) are given in Exhibit A.4.1.1.

The logical relationships between items were taken into account in the imputation to maintain consistency of the data and attempt to preserve correlations among variables. Closely correlated items such as assessment scores or socio-emotional scales were usually imputed from a single donor child. The donor was randomly selected from within a donor pool of children matched by treatment/control group assignment, language spoken at home, sex, race/ethnicity, and age in months as of September 1, 2002. The score and scale variables were imputed in groups according to similar patterns of missingness (i.e., the joint missing rates) and the degree of correlation among them. This strategy was viewed as a compromise between the desire to avoid throwing away reported scores and the goal of preserving the correlation among score variables. In general only the missing scores were imputed on each record, and children with partially reported scores did not have them overwritten by the donor’s scores. However, for patterns of missingness represented by a small number of children, the donor’s scores were allowed to overwrite the reported scores in the interests of reducing the number of computer runs. It should be noted that the percentage of child records with partial reporting of score and scale variables is small. The socio-emotional scales were either entirely missing or entirely reported for all but a trivial (< 0.1%) percentage of the sample. For the depression, locus of control, welfare, and crime and violence scales, 8.3 percent of the sample had partially missing data (5.6 percent were missing all but one scale, 2.5 percent were missing only one scale, and 0.2 percent were missing some other combination). For the continuous score variables, less than 5 percent of the sample had partial reporting of scores; most were either missing all scores or none.

The order in which items are imputed is also important in preserving the correlation structure in the data, because some imputed items can be used to form imputation cells in the subsequent imputation of related items. This strategy was used, for example, in the imputation of categorical assessment scores, so that the first score that was imputed could be used to create imputation cells for the next score. It was also used throughout in the imputation of correlated demographic and household variables. Similarly, for items associated with a skip pattern in the parent questionnaire, the item that leads into the skip pattern was imputed first and the subsequent items were imputed depending on the value of the skip indicator. The demographic variables were imputed first, then used to impute parenting practice, household income, child health, assessment score, and scale variables. Items with the least amount of nonresponse within a group of related categorical variables were imputed first, then used in the imputation of items with larger amounts of missing data.

In general, donors were randomly selected from within the same Head Start program within a cell when possible, collapsed with a geographically adjacent program in the cell when necessary. Programs were sorted within a cell by broad geographic area (our primary sampling unit, or PSU) within Census region, so adjacent programs tended to be from the same county or a nearby county. When there were a large number of imputation cells, the donor search often was broadened to the entire geographic PSU within a cell, and sometimes PSUs within a region were also collapsed. Some items such as fall scores required a closer match on demographic variables than geography or Head Start program in order to find a similar donor pool, and no attempt to stay within the PSU or program was made for these. Geography was also ignored for certain items requiring a very close match to the donor on other questionnaire items for logical consistency.

The distribution of each imputed variable was compared before and after imputation to check that the imputation procedures had not appreciably changed the distribution of the variable. Correlation matrices were examined to check that bivariate correlations among scores and scales were not attenuated. Crosstabs between categorical variables involved in skip patterns and those requiring logical consistency were checked to make sure that inconsistencies had not been introduced. The only variable where the distribution shifted more than a trivial amount was father’s employment status, which had a very high missing rate of 51 percent. The percent age of fathers employed full-time shifted from 74 percent to 71 percent, and the percentage unemployed increased from 16 percent to 20 percent. Fathers for whom employment status is unknown tend to come from cells with higher unemployment rates among respondents; thus, the inclusion of their imputed values will raise the overall unemployment rate. The variables used to create imputation classes for employment status were receipt of food stamps, receipt of TANF, father’s level of education, father’s race, and PSU.

Exhibit A.4.1.1: Item Nonresponse Rates for Imputed Variables
Variable Name	Reported Count	Imputed Count	Percent Imputed	Total of Reported and Imputed Count
Crime & Violence Maximum Likelihood Ability Estimate	3,546	352	9.0%	3,898
Crime & Violence IRT True-Score	3,546	352	9.0%	3,898
Number of children age 17 and under in household	3,796	102	2.6%	3,898
Restricting Child Movement Scale - fall	3,539	359	9.2%	3,898
Family Cultural Enrichment Scale	3,524	374	9.6%	3,898
Family Cultural Enrichment Scale 2	3,540	358	9.2%	3,898
Removing Harmful Objects Subscale - fall	3,538	360	9.2%	3,898
# Times child is read to	3,548	350	9.0%	3,898
Safety Devices Subscale - fall	3,538	360	9.2%	3,898
Parental Safety Practices Scale - fall	3,537	361	9.3%	3,898
Spanked child in last week	3,544	354	9.1%	3,898
# Times spanked child	3,528	370	9.5%	3,898
Used time out in last week	3,542	356	9.1%	3,898
# Times used time out	3,524	374	9.6%	3,898
Adult books in home	3,547	351	9.0%	3,898
Derived caregiver's race	141	7	4.7%	148
Derived child race	3,882	16	0.4%	3,898
Child sex	3,898	0	0.0%	3,898
Derived father's race	3,710	188	4.8%	3,898
Head Start participation	3,897	1	0.0%	3,898
Derived mother's race	3,777	121	3.1%	3,898
Caregiver's age	137	11	7.4%	148
Child born in the United States	3,792	106	2.7%	3,898
Economic difficulty scale	3,525	373	9.6%	3,898
Father's employment status	1,875	2023	51.9%	3,898
Father's highest educational attainment	3,460	438	11.2%	3,898
Father's marital status	3,421	477	12.2%	3,898
Father's age	3,283	615	15.8%	3,898
Biological father's immigrant status	3,702	196	5.0%	3,898
Biological father a recent immigrant	1,273	90	6.6%	1,363
Biological father lives with child	3,660	238	6.1%	3,898
Biological father years in the United States	1,170	193	14.2%	1,363
Grandparent in the household	3,786	112	2.9%	3,898
Anyone in household with health condition	3,537	361	9.3%	3,898
Homelessness	3,535	363	9.3%	3,898
Primary home language	3,870	28	0.7%	3,898
Biological mother's immigrant status	3,773	125	3.2%	3,898
Biological mother recent immigrant	1,210	104	7.9%	1,314
Biological mother lives with child	3,789	109	2.8%	3,898
Biological mother years in the United States	1,210	104	7.9%	1,314
Household monthly income range	3,403	495	12.7%	3,898
Mother's employment status	3,598	300	7.7%	3,898
Mother has a GED	3,757	141	3.6%	3,898
Biological mother educational attainment	3,757	141	3.6%	3,898
Mother's marital status	3,759	139	3.6%	3,898
Mother's age	3,722	176	4.5%	3,898
Number of moves in last 12 months	3,449	449	11.5%	3,898
Other caregiver's employment status	135	13	8.8%	148
Other caregiver's educational attainment	134	14	9.5%	148
Number of adults 18 and over in household	3,534	364	9.3%	3,898
Primary caregiver health impairs caring for child	3,545	353	9.1%	3,898
Primary caregivers health	3,545	353	9.1%	3,898
Child had dental care, fall 02	3,542	356	9.1%	3,898
Child's health status, fall 02	3,544	354	9.1%	3,898
Child had care for an injury, fall 02	3,537	361	9.3%	3,898
Child has health insurance, fall 02	3,542	356	9.1%	3,898
Child needs ongoing health care, fall 02	3,785	113	2.9%	3,898
Child has regular place for medical care, fall 02	3,538	360	9.2%	3,898
PELS, fall 02	3,548	350	9.0%	3,898
Child has special needs, fall 02	3,787	111	2.8%	3,898
Child has an unmet health need, fall 02	3,540	358	9.2%	3,898
Housing problems scale	3,514	384	9.9%	3,898
Receives Food Stamps	3,771	127	3.3%	3,898
Receives TANF	3,765	133	3.4%	3,898
Respondent's relationship to child	3,786	112	2.9%	3,898
Public or subsidized housing	3,523	368	9.5%	3,891
Mother had a teen birth	3,733	165	4.2%	3,898
Number of children under age 6 in household	3,796	102	2.6%	3,898
Depression maximum likelihood ability estimate	3,536	362	9.3%	3,898
Depression IRT true-score	3,536	362	9.3%	3,898
Elision IRT score	2,408	294	10.9%	2,702
Elision true score	2,408	294	10.9%	2,702
PPVT IRT score	3,187	465	12.7%	3,652
PPVT true score	3,187	465	12.7%	3,652
PPVT standard score	3,187	465	12.7%	3,652
PPVT W-ability score	3,187	465	12.7%	3,652
Spanish Elision IRT score	1,015	124	10.9%	1,139
Spanish Elision true score	1,015	124	10.9%	1,139
TVIP IRT score	1,038	101	8.9%	1,139
TVIP true score	1,038	101	8.9%	1,139
TVIP standard score	1,038	101	8.9%	1,139
TVIP W-ability score	1,038	101	8.9%	1,139
Locus of control IRT scale score	3,534	364	9.3%	3,898
Locus of control true scale score	3,534	364	9.3%	3,898
Is respondent mother or father?	3,796	102	2.6%	3,898
How well did child do in bear counting	3,434	464	11.9%	3,898
Bear counting score	3,260	638	16.4%	3,898
Book score, total	3,473	425	10.9%	3,898
Color name score, total	3,516	382	9.8%	3,898
CTOPPP Elision total score	2,408	294	10.9%	2,702
CTOPPP Spanish Elision total score	1,015	124	10.9%	1,139
CTOPPP print score	1,034	105	9.2%	1,139
McCarthy total drawing score	3,508	390	10.0%	3,898
KFAST raw score	3,220	678	17.4%	3,898
PPVT: total score	3,187	465	12.7%	3,652
Print knowledge score: total	3,445	453	11.6%	3,898
TVIP: total score	1,038	101	8.9%	1,139
WJ3 Applied problems standard score	2,392	310	11.5%	2,702
S_WJ3APPLIED_W	2,392	310	11.5%	2,702
WJ3 Applied problems W score	2,378	324	12.0%	2,702
WJ3 Oral comprehension standard score	2,378	324	12.0%	2,702
WJ3 Oral comprehension W score	2,426	276	10.2%	2,702
WJ3 Spelling W score	2,426	276	10.2%	2,702
WJ3 Letter-word standard score	3,217	435	11.9%	3,652
WJ3 Letter-word W score	3,217	435	11.9%	3,652
WJ3 Applied problems total score	2,392	310	11.5%	2,702
WJ3 Oral comprehension, total score	2,378	324	12.0%	2,702
WJ3 Spelling, total score	2,426	276	10.2%	2,702
WJ3 Letter-word total score	3,217	435	11.9%	3,652
WM Applied problems total score	1,017	122	10.7%	1,139
WM Applied problems, standard score	1,017	122	10.7%	1,139
WM Applied problems, W score	1,017	122	10.7%	1,139
WM Dictation, total score	1,024	115	10.1%	1,139
WM Dictation, standard score	1,024	115	10.1%	1,139
WM Dictation, W score	1,024	115	10.1%	1,139
WM Letter-word, total score	1,028	111	9.7%	1,139
WM Letter-word, standard score	1,028	111	9.7%	1,139
WM Letter-word, W score	1,028	111	9.7%	1,139
Child age as of 9/1/02	3,898	0	0.0%	3,898
Welfare IRT scale score	3,689	209	5.4%	3,898
Welfare true scale score	3,689	209	5.4%	3,898

Table of Contents | Previous | Next