OVERVIEW AND EVALUATION OF AGGIES, AN AUTOMATED EDIT AND IMPUTATION SYSTEM

Todd Todaro and Kara Perritt

National Agricultural Statistics Service

Introduction

The National Agricultural Statistics Service (NASS) collects and summarizes information about the nation's agriculture through the use of a variety of surveys and the Census of Agriculture. After data collection and prior to the summarization and publication of statistics, the data are edited for completeness and consistency. Obtaining edited data that are accurate is important for making inference of the underlying population characteristics (e.g., estimating population totals and ratios). Edited data are also used as control data for designing future surveys and improving the accuracy of the estimates from them.

A wide variety of editing tools have been developed over the past several decades to improve the editing process and data quality (e.g., macro editing, selective editing and statistical editing). The different types of edit tools can be thought to have a complementary effect. For example, macro editing techniques can be used to selectively identify suspicious data having a large impact at an aggregate level, drill down to the micro data and make corrective actions. The remaining data, those having a minor impact on aggregate levels, could then be edited using a micro editing system to ensure data consistency. The goal, then, is to come up with an appropriate mix of edit tools to form a complete edit strategy that will improve data quality and the quality of the edit process (De Jong, 1996).

This paper discusses an automated edit and imputation system that is currently under research called the Agricultural Generalized Imputation and Edit System (AGGIES). Much of the methodology is based on the Generalized Edit and Imputation System (GEIS) developed at Statistics Canada (Cotton, 1993). AGGIES was developed using the SAS programming language and, with its object oriented features, allows the user to easily run the system and make selections using the mouse to point and click. It is designed to edit nonnegative, continuous values and requires the edits to be of linear form - linear inequalities or linear equalities. The system comprises a number of modules, each performing a separate function.

Currently in NASS, edit tools include Blaise, the "Survey Processing" System (SPS), the Interactive Data Analysis System (IDAS), and the complex edit (Pense, 1997; Todaro, 1999b). The following reasons suggest the consideration of using AGGIES as part of NASS' complete edit strategy. Firstly, AGGIES is a generalized automated edit and imputation system that can be applied to both surveys and censuses. Presently, Blaise, SPS and IDAS require editor intervention to correct data and are only used for editing survey data. The complex edit allows for machine error correction but is written specifically for use when editing data from the Census of Agriculture. With the adoption of AGGIES as a core edit tool, a system could be devised for editing both survey and census data, engendering consistency in the editing process. Secondly, since AGGIES and IDAS are both written in SAS, the two systems could be integrated to form an enlarged core system performing micro and macro editing functions as is currently being done for a pilot project.

The next section describes the modules comprising AGGIES. After which, a section with results from a preliminary application to NASS data and future evaluation plans are presented.

AGGIES Description

The edits are entered into the system in the edit specification module. An edit is specified by typing in the edit identifier and coefficients, and selecting the variables and an inequality or equality sign from selection lists. A maximum of twenty variables can contribute to an edit. Error checking features ensure that all coefficients are numeric and all components forming the edit are entered.

An edit may be modified by selecting the 'modify edit' option which displays a list of edit identifiers corresponding to all edits that have been entered into the system. Upon the selection of an edit identifier, the edit specification screen appears with the edit information filled in for the corresponding edit. To delete an edit, the 'delete edit' option is selected followed by the selection of an edit identifier corresponding to the edit.

The edit check module checks for consistency of the entire edit set, redundant edits and hidden equality edits. An edit set is inconsistent if no data record can satisfy all edits simultaneously; otherwise it is consistent. A redundant edit is an edit that is implied by two or more edits in the edit set. A hidden equality edit is an equality edit not contained in the edit set, but rather implied by two or more inequality edits in the edit set. The output of this module displays any edits that are redundant, any edits that imply a hidden equality, and the range of values for every variable involved in at least one edit.

Edit and data groups may be formed by selecting the 'form edit groups' option. Edit groups are a subset of edits that are applied to a collection of data records called a data group. An edit group is formed by selecting the edit identifiers corresponding to the edits forming the edit group. A data group is created by forming a SAS subsetting condition that describes the data records belonging the data group. Any number of edit/data groups may be formed. The AGGIES will process all of the groups in a single run.

The edit summary module displays summary output information from applying the edits to the data records. Counts of the number of records satisfying all edits and failing at least one edit are displayed in the first output. The second output displays for each edit, including positivity edits (since the values are required to be non-negative), the number of records satisfying and failing each edit.

Outliers can be detected by selecting the outlier detection module. This module identifies univariate outliers utilizing the Hidiroglou-Berthelot method (Cotton, 1993) using current data. Since it has been observed that a large number of outliers may result, only those outlying records that are also involved in a failed edit are displayed.

For those data records failing one or more edits, the error localization module identifies the fewest values to change per record so that after imputation, the record can satisfy all of the edits. An option allows for the specification of variable reliability weights, with the default weights equal to one. If weights other than one are specified, then the fewest weighted values are changed rather than the fewest values. Thus, all things being equal, the higher the weight for a variable, the less likely the variable value will be changed. The methodology underlying this module is based on Chernikova's algorithm (Schiopu-Kratina and Kovar, 1989). The output of this module consists of two parts. The number of times each value was identified to be changed is displayed in the first part. The second part displays for each record having at least one value identified to be changed, the originally reported record followed by the error-localized record. The distinguishing feature of the error-localized record is the placement of the value '-1' for those values identified to be changed.

Prior to the imputation of values, several input options are available. The first allows for the selection of the order for which the variables are imputed in the data records. Second, previously imputed values may either be included or excluded to impute for current values. Third, for each variable, the selection of up to six imputation estimators and their order of application may be made. If more than one imputation estimator is selected for a particular variable, imputation is attempted using the estimators in the selected order. The value of the first imputation estimator that will result in the data record satisfying all edits is imputed. If no imputation estimator is selected for a particular variable or none of the imputation estimators will result in the record satisfying all edits, for each variable, the set of values that will result in the record satisfying all edits is calculated and the midpoint of this set is imputed. This default midpoint imputation method, borrowed from the Structured Programs for Economic Editing and Referrals (SPEER) system, guarantees that each data record will satisfy all edits after imputation (Todaro, 1997).

The output of imputation consist of two parts. Imputation counts by imputation estimator by variable are displayed in the first part. The second part displays for each data record having at least one value imputed, the originally reported record followed by the imputed record.

An interactive screen has been added to AGGIES that allows the values to be interactively edited.

This screen displays two forms. The form on the right-hand side displays the current data and can be interactively modified. The form on the left-hand side displays information that may be useful for editing data such as originally reported data or historical data. Different data records can be displayed by selecting the identification information from a selection list created (e.g., a list may have been created that contains those outlying records identified in the outlier detection module) or by typing in the identification information for a particular record. When values are modified in the right form, the edits are instantaneously applied. If any edits fail, those values in the form that are involved in a failed edit are highlighted. The failed edits may also be viewed. Although this interactive screen has been a valuable addition, to improve the display of the values it has to be customized for each application.

Application

The first data used to evaluate AGGIES were from the September 1996 Iowa Quarterly Hog Report survey. For this evaluation, aggregate statistics from AGGIES were compared with those from the current Blaise/SPS/IDAS editing system which was treated as "truth". The results, published by Todaro (1999a), were encouraging; however, since it was a one state, one survey study, a more complete evaluation of the system was needed.

The next evaluation used the January 1999 Sheep Report survey data for California, Colorado, Texas and Wyoming. The following gives the basic evaluation procedures used and is succeeded by the results.

Prior to the survey period, historic Sheep Report data were used to fine tune the edits, reliability weights, and imputation estimators. Next, the Blaise interactive edit (IE) was modified to only flag administrative coding errors and calculate weighting adjustments. As the survey commenced, the four states involved were instructed to do minimal manual editing on paper questionnaires but to otherwise process the data as usual, i.e., use the Blaise/SPS/IDAS editing system. After the survey was completed, the post-Blaise but pre-SPS/IDAS data were run through AGGIES. In other words, the only edits done on these data were administrative checks. From the AGGIES' output, expanded aggregate statistics were compared to those calculated from the survey edited data that had, during the live production, gone through the complete Blaise/SPS/IDAS editing process. Data for the state with the largest percentage of records with errors, Wyoming, were run through AGGIES three times to assess repeatability of the results. Finally, to complete the evaluation, editing and imputation accuracy indices were calculated for Wyoming's data.

Table 1 displays an overview of the evaluation results for the four states. Included are the total number of records processed, the total time required for error localization, the number of records failing at least one AGGIES' edit, and the number of variables, out of 20 common variables, with an absolute percent difference greater than five. This percent difference compared the expanded AGGIES imputed data to the expanded survey edited data.

Table 1. Overview of Evaluation Results

State Total Records Error Localization Time^1/ (min) Records with Errors Number of Variables^2/ with Absolute Percent Difference^3/ Greater than 5%

CA 555 2 69 (12%) 5

CO 595 3 102 (17%) 1

WY 906 12^4/ 184 (20%) 2^4/

TX 2,400 4 99 (4%) 3

1/ 400-MHz Pentium computer

2/ Out of 20 variables common to all states

3/ Between expanded AGGIES imputed data and expanded survey edited data

4/ Average of three runs through AGGIES

Since Wyoming had the largest percentage of records with errors, evaluation results discussed from hereon will be limited to that State. Prior to discussing specific results, however, a few comments regarding procedures should be mentioned. When the error localization module was run, a 10 second time limit was imposed. In other words, the computer was allotted 10 seconds to error localize each individual record. If the time limit was exceeded for a particular record, AGGIES stopped processing that record and went on to the next one. The data for any record that exceeded this 10 second limit were replaced with the survey edited data. Likewise, data from any record identified as an outlier with respect to the 'total sheep and lambs' variable were replaced with the survey edited data. Wyoming had no outlying records identified. However, AGGIES' processing on the first run was slowed by heavy local area network (LAN) traffic resulting in 15 time limit exceeded records. The second and third runs had no time limit exceeded records.

Table 2 indicates the variability at the expanded level between the three AGGIES' runs for the Wyoming data. The table displays, for a subset of the survey variables, the AGGIES expanded total for each run and the standard deviation between the runs. The table is sorted by the standard deviation in descending order.

Table 2. Variability in the Expanded Total for the Three Wyoming Runs

Variable AGGIES Expanded Total Standard Deviation

Run 1 Run 2 Run 3

Wool from Breeding Animals 4,198,744 4,201,843 4,201,843 1,789

Wool from Market Animals 493,216 494,612 494,612 806

Breeding Animals Shorn 446,837 448,106 448,106 733

Market Animals Shorn 121,925 122,273 122,273 201

Total Sheep and Lambs 574,431 574,415 574,244 104

Market Lambs 85 to 105 lbs. 73,393 73,381 73,381 7

Rams for Breeding 12,215 12,214 12,214 1

Market Lambs Over 105 lbs. 29,419 29,419 29,419 0

Market Sheep 2,618 2,618 2,618 0

Ewes for Breeding 363,222 363,222 363,222 0

Notice that, for the first four variables listed, run 2 and run 3 have identical expanded totals so the only contribution to variability for these variables is from run 1. Records that exceeded the error localization time limit in the first run caused this variability. Increasing this imposed time limit or running the program on a faster computer would likely rectify this situation for these variables.

Table 3 shows, for a subset of the survey variables, the reliability weights assigned to each, the average expanded totals for Wyoming calculated from running the data through AGGIES three times, the expanded survey edited totals, and the percent difference between the two. The table is sorted by absolute percent difference in descending order.

Table 3. Comparison of Average Expanded Totals for Wyoming Data

Variable Weight AGGIES Average Survey Edited Percent Difference

Market Sheep 3 2,618 1,739 50.55

Breeding Animals Shorn 2 447,683 443,548 0.93

Total Sheep and Lambs 1 574,363 570,890 0.61

Ewes for Breeding 4 363,222 361,095 0.59

Wool from Breeding Animals 1 4,200,810 4,225,283 -0.58

Rams for Breeding 5 12,214 12,153 0.50

Market Animals Shorn 2 122,157 121,809 0.29

Wool from Market Animals 1 494,146 493,623 0.11

Market Lambs Over 105 lbs. 2 29,419 29,411 0.03

Market Lambs 85 to 105 lbs. 1 73,385 73,381 0.01

The large percentage difference for the 'market sheep' variable can be directly attributed to one record that failed a balance edit. During the survey process, the 'market sheep' value for this record was zeroed out in order to pass the balance edit. AGGIES, however, will always update another variable due to the assignments of the reliability weights. These assignments work well for the majority of the records, so no change in weights is advocated. The remedy for this record perhaps lies in using another tool, such as IDAS, for a macro review of the data with drill-down capability.

To complete this evaluation of AGGIES, Manzari and Rocca's (1999) accuracy indices were calculated from the Wyoming output. These indices evaluate the quality of the AGGIES' editing and imputation procedures based on the number of detected, undetected and introduced errors. The indices are divided into three groups of three: the first three indices assess the quality of editing, the next three assess the quality of imputation and the final three assess the overall quality of both editing and imputation. Table 4 describes each index and supplies its formula.

Table 4. Accuracy Indices (Manzari and Rocca, 1999)

Assessing ... Index Calculation

Editing Quality I1: fraction of unmodified data correctly handled d/(c+d)

I2: fraction of modified data correctly handled a/(a+b)

I3: fraction of total data correctly handled (a+d)/(a+b+c+d)

Imputation Quality I4: fraction of changed, unmodified data whose original value is correctly restored c_s/c

I5: fraction of changed, modified data whose original value is correctly restored a_s/a

I6: fraction of changed total data whose original value is correctly restored (a_s+c_s)/(a+c)

Overall Editing and Imputation Quality I7: fraction of unmodified data whose original value is correctly restored (c_s+d)/(c+d)

I8: fraction of modified data whose original value is correctly restored a_s/(a+b)

I9: fraction of total data whose original value is correctly restored (a_s+c_s+d)/(a+b+c+d)

Where:

modified = survey edited data that does not equal the reported data

unmodified = survey edited data that equals the reported data

changed = AGGIES edited data that does not equal the reported data

not changed = AGGIES edited data that equals the reported data

a = number of modified data identified to be changed in AGGIES

a_s = number of modified data identified to be changed by AGGIES and imputation was successful

a_f = number of modified data identified to be changed by AGGIES and imputation failed

b = number of modified data identified not to be changed by AGGIES

c = number of unmodified data identified to be changed by AGGIES

c_s = number of unmodified data identified to be changed by AGGIES and imputation was successful

c_f = number of unmodified data identified to be changed by AGGIES and imputation failed

d = number of unmodified data identified not to be changed by AGGIES

		AGGIES Edited Data
		Changed	Not Changed
Survey Edited Data	Modified	a = a_s+a_f	b
Survey Edited Data	Unmodified	c = c_s+c_f	d

Whether the AGGIES imputation was classified as a success or a failure was based on comparing the imputed value to the survey edited value. If the imputed value was exclusively within five percent of the survey edited value, the imputation was categorized as successful; otherwise, it was a failure.

In Table 5, for a subset of the Wyoming variables, the accuracy indices are given averaged over the three AGGIES runs. Indices range from 0% (no accuracy) to 100% (maximum accuracy). The table is sorted in ascending order by index I9, which measures the overall accuracy for the whole data set.

Table 5. Average Value of the Accuracy Indices, Given In Percent

Variable I1 I2 I3 I4 I5 I6 I7 I8 I9

Wool from Breeding Animals 100 94 99 0 16 16 100 15 86

Breeding Animals Shorn 100 91 99 - 53 53 100 48 97

Wool from Market Animals 99 77 99 0 13 9 99 10 98

Ewes for Breeding 100 50 99 - 100 100 100 50 99

Market Animals Shorn 100 100 100 - 47 47 100 47 99

Rams for Breeding 100 50 99 - 87 87 100 43 99

Total Sheep and Lambs 100 75 99 41 91 78 100 59 99

Market Lambs Over 105 lbs. 100 67 100 - 100 100 100 67 100

Market Lambs 85 to 105 lbs. 100 100 100 0 100 89 100 100 100

Market Sheep 100 40 100 - 100 100 100 40 100

The high values for index I1 indicate that the error localization algorithm performs well and that AGGIES introduces few new errors in the data. The I2 index measures the ability to detect errors in the data. Note that all four of the I2 values less than 75 are linked to variables with high reliability weights. Simply changing reliability weights, however, will not necessarily increase the I2 index. In order to significantly increase AGGIES' power to detect errors in these variables, new inconsistencies, i.e., new edits, would need to be defined. High index I3 values indicate that the overall AGGIES editing accuracy for all variables is strong.

The lack of unmodified data imputed by AGGIES, c, somewhat diminishes the ability of indices I4 and I6 to assess imputation quality. Index I5, which measures the effectiveness of AGGIES to impute values within five percent of the survey edited values, shows that AGGIES does well for six of the ten variables. AGGIES' imputation estimators as well as current survey imputation procedures for the variables 'wool from market animals', 'wool from breeding animals', 'market animals shorn', and 'breeding animals shorn' need to be reviewed to identify the cause of the inconsistencies in the imputed values.

Indices I7 and I8 indicate the overall editing and imputation quality for unmodified and modified data respectively. AGGIES appears to handle unmodified data more effectively; however, given the overall editing and imputation quality index, I9, the system, on the whole, was proficient in treating the data for all variables.

Another evaluation of AGGIES currently underway uses the July 1999 Sheep Report data for the same four states, California, Colorado, Texas and Wyoming. This project took place parallel to the live production using the same general procedures used for the January data: minimal hand editing, minimized Blaise interactive edit checks, and batch runs through AGGIES. Deviating from the past evaluation, outlying reports and records that exceeded the imposed error localization time limit were interactively reviewed in the interactive AGGIES module. Also, in contrast to the previous evaluation, the AGGIES output file was reviewed at the macro level in IDAS with updates made in the AGGIES interactive module. It will be this post-AGGIES/IDAS data that will be compared to the survey edited data. This evaluation should more aptly mimic the entire proposed survey processing procedures.

The last ongoing AGGIES project uses the 1997 Census of Agriculture hog data for Iowa and North Carolina. Current survey edit specifications were used to develop the AGGIES edits, reliability weights, and imputation estimators. Approximately 17,000 Iowa records and 3,000 North Carolina records with positive, original keyed hog data were obtained. A preliminary run, using the 17,000 Iowa reports, was completed in order to get an idea of the time requirements of the system. The error localization module took 3.5 hours to run on a 400-MHz Pentium, 55% of the records had errors and of these, 9% exceeded a five second time limit. Further research using this data will be conducted and a comparison to the final Census hog numbers made.

Conclusions

Using AGGIES has several potential advantages for NASS:

Commodity data editing and imputation performed by AGGIES results in a data set similar to the one produced by NASS, as demonstrated using the data from the 1996 Iowa Quarterly Hog Report and the January 1999 Sheep Report for California, Colorado, Texas and Wyoming. This minimizes the need for manually reviewing and correcting the data, thereby, allowing for a more efficient review of the data with the potential for cost and time savings. In addition, there should be more time for review by statisticians at the macro level.

The editing and imputation functions of AGGIES are performed objectively which allows for consistency throughout the data cleaning process with the results being nearly repeatable. Only when there are multiple solutions identified in the error localization module can the results of separate repetitions through the system differ.

The system can be easily applied to any number of surveys, thus conserving resources to the development and maintenance of a single system. Survey edits and imputation parameters are the main inputs that need to be maintained on a survey-to-survey basis.

However, there are several issues to address when using the AGGIES for NASS surveys and the Agricultural Census:

AGGIES will not perform all editing functions. It is designed for continuous, non-negative data. Editing of completion codes and data adjustment factors must be performed outside of the system.

A plan as to how AGGIES could be implemented in NASS's survey editing and imputation process to form a complete edit and imputation strategy and system integration needs to be finalized. The integration of AGGIES with NASS's IDAS is already underway. Results from the July 1999 Sheep Report pilot study will provide insight into completing this integration and should lay the groundwork for AGGIES' implementation plans and additional integrations.

References

Cotton, C. (1993), "Functional Description of the Generalized Edit and Imputation System," Statistics Canada Technical Report.

De Jong, W. A. M., (1996), "Designing a Complete Edit Strategy; Combining Techniques," Working Paper No. 29, Conference of European Statisticians, Work Session on Statistical Data Editing, United Nations Statistical Commission and Economic Commission for Europe.

Manzari, A., and Della Rocca, G. (1999), "A Generalised System Based On a Simulation Approach to Test the Quality of Editing and Imputation Procedures," Working Paper No. 13, Conference of European Statisticians, Work Session on Statistical Data Editing, United Nations Statistical Commission and Economic Commission for Europe.

Pense, R. (1997), "Editing Strategies at the National Agricultural Statistics Service," Working Paper No. 31, Conference of European Statisticians, Work Session on Statistical Data Editing, United Nations Statistical Commission and Economic Commission for Europe.

Schiopu-Kratina, I.., and Kovar, J. G. (1989), "Use of Chernikova's Algorithm in the Generalized Edit and Imputation System," Statistics Canada, Methodology Branch Working Paper No. BSMD-89-001E.

Todaro, T. A., (1997), "Evaluation of the SPEER Automatic Edit and Imputation System," National Agricultural Statistics Service, USDA, Washington D.C., RD Research Report No. RD-97-04.

Todaro, T. A., (1999a), "Evaluation of the AGGIES Automated Edit and Imputation System," National Agricultural Statistics Service, USDA, Washington D.C., RD Research Report No. RD-99-01.

Todaro, T. A., (1999b), "Overview and Evaluation of the AGGIES Automated Edit and Imputation System," Working Paper No. 19, Conference of European Statisticians, Work Session on Statistical Data Editing, United Nations Statistical Commission and Economic Commission for Europe.

State	Total Records	Error Localization Time^1/ (min)	Records with Errors	Number of Variables^2/ with Absolute Percent Difference^3/ Greater than 5%
CA	555	2	69 (12%)	5
CO	595	3	102 (17%)	1
WY	906	12^4/	184 (20%)	2^4/
TX	2,400	4	99 (4%)	3

Variable	AGGIES Expanded Total			Standard Deviation
Variable	Run 1	Run 2	Run 3	Standard Deviation
Wool from Breeding Animals	4,198,744	4,201,843	4,201,843	1,789
Wool from Market Animals	493,216	494,612	494,612	806
Breeding Animals Shorn	446,837	448,106	448,106	733
Market Animals Shorn	121,925	122,273	122,273	201
Total Sheep and Lambs	574,431	574,415	574,244	104
Market Lambs 85 to 105 lbs.	73,393	73,381	73,381	7
Rams for Breeding	12,215	12,214	12,214	1
Market Lambs Over 105 lbs.	29,419	29,419	29,419	0
Market Sheep	2,618	2,618	2,618	0
Ewes for Breeding	363,222	363,222	363,222	0

Variable	Weight	AGGIES Average	Survey Edited	Percent Difference
Market Sheep	3	2,618	1,739	50.55
Breeding Animals Shorn	2	447,683	443,548	0.93
Total Sheep and Lambs	1	574,363	570,890	0.61
Ewes for Breeding	4	363,222	361,095	0.59
Wool from Breeding Animals	1	4,200,810	4,225,283	-0.58
Rams for Breeding	5	12,214	12,153	0.50
Market Animals Shorn	2	122,157	121,809	0.29
Wool from Market Animals	1	494,146	493,623	0.11
Market Lambs Over 105 lbs.	2	29,419	29,411	0.03
Market Lambs 85 to 105 lbs.	1	73,385	73,381	0.01

Assessing ...	Index	Calculation
Editing Quality	I1: fraction of unmodified data correctly handled	d/(c+d)
	I2: fraction of modified data correctly handled	a/(a+b)
	I3: fraction of total data correctly handled	(a+d)/(a+b+c+d)
Imputation Quality	I4: fraction of changed, unmodified data whose original value is correctly restored	c_s/c
	I5: fraction of changed, modified data whose original value is correctly restored	a_s/a
	I6: fraction of changed total data whose original value is correctly restored	(a_s+c_s)/(a+c)
Overall Editing and Imputation Quality	I7: fraction of unmodified data whose original value is correctly restored	(c_s+d)/(c+d)
	I8: fraction of modified data whose original value is correctly restored	a_s/(a+b)
	I9: fraction of total data whose original value is correctly restored	(a_s+c_s+d)/(a+b+c+d)