OVERVIEW AND EVALUATION OF AGGIES, AN AUTOMATED EDIT AND IMPUTATION SYSTEM
Introduction
The National Agricultural Statistics Service (NASS) collects and summarizes information about the nation's agriculture through the use of a variety of surveys and the Census of Agriculture. After data collection and prior to the summarization and publication of statistics, the data are edited for completeness and consistency. Obtaining edited data that are accurate is important for making inference of the underlying population characteristics (e.g., estimating population totals and ratios). Edited data are also used as control data for designing future surveys and improving the accuracy of the estimates from them.
A wide variety of editing tools have been developed over the past several decades to improve the editing process and data quality (e.g., macro editing, selective editing and statistical editing). The different types of edit tools can be thought to have a complementary effect. For example, macro editing techniques can be used to selectively identify suspicious data having a large impact at an aggregate level, drill down to the micro data and make corrective actions. The remaining data, those having a minor impact on aggregate levels, could then be edited using a micro editing system to ensure data consistency. The goal, then, is to come up with an appropriate mix of edit tools to form a complete edit strategy that will improve data quality and the quality of the edit process (De Jong, 1996).
This paper discusses an automated edit and imputation system that is currently under research called the Agricultural Generalized Imputation and Edit System (AGGIES). Much of the methodology is based on the Generalized Edit and Imputation System (GEIS) developed at Statistics Canada (Cotton, 1993). AGGIES was developed using the SAS programming language and, with its object oriented features, allows the user to easily run the system and make selections using the mouse to point and click. It is designed to edit nonnegative, continuous values and requires the edits to be of linear form - linear inequalities or linear equalities. The system comprises a number of modules, each performing a separate function.
Currently in NASS, edit tools include Blaise, the "Survey Processing" System (SPS), the Interactive Data Analysis System (IDAS), and the complex edit (Pense, 1997; Todaro, 1999b). The following reasons suggest the consideration of using AGGIES as part of NASS' complete edit strategy. Firstly, AGGIES is a generalized automated edit and imputation system that can be applied to both surveys and censuses. Presently, Blaise, SPS and IDAS require editor intervention to correct data and are only used for editing survey data. The complex edit allows for machine error correction but is written specifically for use when editing data from the Census of Agriculture. With the adoption of AGGIES as a core edit tool, a system could be devised for editing both survey and census data, engendering consistency in the editing process. Secondly, since AGGIES and IDAS are both written in SAS, the two systems could be integrated to form an enlarged core system performing micro and macro editing functions as is currently being done for a pilot project.
The next section describes the modules comprising AGGIES. After which, a section with results from a preliminary application to NASS data and future evaluation plans are presented.
AGGIES Description
The edits are entered into the system in the edit specification module. An edit is specified by typing in the edit identifier and coefficients, and selecting the variables and an inequality or equality sign from selection lists. A maximum of twenty variables can contribute to an edit. Error checking features ensure that all coefficients are numeric and all components forming the edit are entered.
An edit may be modified by selecting the 'modify edit' option which displays a list of edit identifiers corresponding to all edits that have been entered into the system. Upon the selection of an edit identifier, the edit specification screen appears with the edit information filled in for the corresponding edit. To delete an edit, the 'delete edit' option is selected followed by the selection of an edit identifier corresponding to the edit.
The edit check module checks for consistency of the entire edit set, redundant edits and hidden equality edits. An edit set is inconsistent if no data record can satisfy all edits simultaneously; otherwise it is consistent. A redundant edit is an edit that is implied by two or more edits in the edit set. A hidden equality edit is an equality edit not contained in the edit set, but rather implied by two or more inequality edits in the edit set. The output of this module displays any edits that are redundant, any edits that imply a hidden equality, and the range of values for every variable involved in at least one edit.
Edit and data groups may be formed by selecting the 'form edit groups' option. Edit groups are a subset of edits that are applied to a collection of data records called a data group. An edit group is formed by selecting the edit identifiers corresponding to the edits forming the edit group. A data group is created by forming a SAS subsetting condition that describes the data records belonging the data group. Any number of edit/data groups may be formed. The AGGIES will process all of the groups in a single run.
The edit summary module displays summary output information from applying the edits to the data records. Counts of the number of records satisfying all edits and failing at least one edit are displayed in the first output. The second output displays for each edit, including positivity edits (since the values are required to be non-negative), the number of records satisfying and failing each edit.
Outliers can be detected by selecting the outlier detection module. This module identifies univariate outliers utilizing the Hidiroglou-Berthelot method (Cotton, 1993) using current data. Since it has been observed that a large number of outliers may result, only those outlying records that are also involved in a failed edit are displayed.
For those data records failing one or more edits, the error localization module identifies the fewest values to change per record so that after imputation, the record can satisfy all of the edits. An option allows for the specification of variable reliability weights, with the default weights equal to one. If weights other than one are specified, then the fewest weighted values are changed rather than the fewest values. Thus, all things being equal, the higher the weight for a variable, the less likely the variable value will be changed. The methodology underlying this module is based on Chernikova's algorithm (Schiopu-Kratina and Kovar, 1989). The output of this module consists of two parts. The number of times each value was identified to be changed is displayed in the first part. The second part displays for each record having at least one value identified to be changed, the originally reported record followed by the error-localized record. The distinguishing feature of the error-localized record is the placement of the value '-1' for those values identified to be changed.
Prior to the imputation of values, several input options are available. The first allows for the selection of the order for which the variables are imputed in the data records. Second, previously imputed values may either be included or excluded to impute for current values. Third, for each variable, the selection of up to six imputation estimators and their order of application may be made. If more than one imputation estimator is selected for a particular variable, imputation is attempted using the estimators in the selected order. The value of the first imputation estimator that will result in the data record satisfying all edits is imputed. If no imputation estimator is selected for a particular variable or none of the imputation estimators will result in the record satisfying all edits, for each variable, the set of values that will result in the record satisfying all edits is calculated and the midpoint of this set is imputed. This default midpoint imputation method, borrowed from the Structured Programs for Economic Editing and Referrals (SPEER) system, guarantees that each data record will satisfy all edits after imputation (Todaro, 1997).
The output of imputation consist of two parts. Imputation counts by imputation estimator by variable are displayed in the first part. The second part displays for each data record having at least one value imputed, the originally reported record followed by the imputed record.
An interactive screen has been added to AGGIES that allows the values to be interactively edited.
This screen displays two forms. The form on the right-hand side displays the current data and can be interactively modified. The form on the left-hand side displays information that may be useful for editing data such as originally reported data or historical data. Different data records can be displayed by selecting the identification information from a selection list created (e.g., a list may have been created that contains those outlying records identified in the outlier detection module) or by typing in the identification information for a particular record. When values are modified in the right form, the edits are instantaneously applied. If any edits fail, those values in the form that are involved in a failed edit are highlighted. The failed edits may also be viewed. Although this interactive screen has been a valuable addition, to improve the display of the values it has to be customized for each application.
Application
The first data used to evaluate AGGIES were from the September 1996 Iowa Quarterly Hog Report survey. For this evaluation, aggregate statistics from AGGIES were compared with those from the current Blaise/SPS/IDAS editing system which was treated as "truth". The results, published by Todaro (1999a), were encouraging; however, since it was a one state, one survey study, a more complete evaluation of the system was needed.
The next evaluation used the January 1999 Sheep Report survey data for California, Colorado, Texas and Wyoming. The following gives the basic evaluation procedures used and is succeeded by the results.
Prior to the survey period, historic Sheep Report data were used to fine tune the edits, reliability weights, and imputation estimators. Next, the Blaise interactive edit (IE) was modified to only flag administrative coding errors and calculate weighting adjustments. As the survey commenced, the four states involved were instructed to do minimal manual editing on paper questionnaires but to otherwise process the data as usual, i.e., use the Blaise/SPS/IDAS editing system. After the survey was completed, the post-Blaise but pre-SPS/IDAS data were run through AGGIES. In other words, the only edits done on these data were administrative checks. From the AGGIES' output, expanded aggregate statistics were compared to those calculated from the survey edited data that had, during the live production, gone through the complete Blaise/SPS/IDAS editing process. Data for the state with the largest percentage of records with errors, Wyoming, were run through AGGIES three times to assess repeatability of the results. Finally, to complete the evaluation, editing and imputation accuracy indices were calculated for Wyoming's data.
Table 1 displays an overview of the evaluation results for the four states. Included are the total number of records processed, the total time required for error localization, the number of records failing at least one AGGIES' edit, and the number of variables, out of 20 common variables, with an absolute percent difference greater than five. This percent difference compared the expanded AGGIES imputed data to the expanded survey edited data.
Table 1. Overview of Evaluation Results
State |
Total Records | Error Localization Time1/ (min) | Records with Errors | Number of Variables2/ with Absolute Percent Difference3/ Greater than 5% |
CA | 555 | 2 | 69 (12%) | 5 |
CO | 595 | 3 | 102 (17%) | 1 |
WY | 906 | 124/ | 184 (20%) | 24/ |
TX | 2,400 | 4 | 99 (4%) | 3 |
Since Wyoming had the largest percentage of records with errors, evaluation results discussed from
hereon will be limited to that State. Prior to discussing specific results, however, a few comments
regarding procedures should be mentioned. When the error localization module was run, a 10
second time limit was imposed. In other words, the computer was allotted 10 seconds to error
localize each individual record. If the time limit was exceeded for a particular record, AGGIES
stopped processing that record and went on to the next one. The data for any record that exceeded
this 10 second limit were replaced with the survey edited data. Likewise, data from any record
identified as an outlier with respect to the 'total sheep and lambs' variable were replaced with the
survey edited data. Wyoming had no outlying records identified. However, AGGIES' processing
on the first run was slowed by heavy local area network (LAN) traffic resulting in 15 time limit
exceeded records. The second and third runs had no time limit exceeded records.
Table 2 indicates the variability at the expanded level between the three AGGIES' runs for the
Wyoming data. The table displays, for a subset of the survey variables, the AGGIES expanded total
for each run and the standard deviation between the runs. The table is sorted by the standard
deviation in descending order.
Table 2. Variability in the Expanded Total for the Three Wyoming Runs
Notice that, for the first four variables listed, run 2 and run 3 have identical expanded totals so the
only contribution to variability for these variables is from run 1. Records that exceeded the error
localization time limit in the first run caused this variability. Increasing this imposed time limit or
running the program on a faster computer would likely rectify this situation for these variables.
Table 3 shows, for a subset of the survey variables, the reliability weights assigned to each, the
average expanded totals for Wyoming calculated from running the data through AGGIES three
times, the expanded survey edited totals, and the percent difference between the two. The table is
sorted by absolute percent difference in descending order.
Table 3. Comparison of Average Expanded Totals for Wyoming Data
The large percentage difference for the 'market sheep' variable can be directly attributed to one
record that failed a balance edit. During the survey process, the 'market sheep' value for this record
was zeroed out in order to pass the balance edit. AGGIES, however, will always update another
variable due to the assignments of the reliability weights. These assignments work well for the
majority of the records, so no change in weights is advocated. The remedy for this record perhaps
lies in using another tool, such as IDAS, for a macro review of the data with drill-down capability.
To complete this evaluation of AGGIES, Manzari and Rocca's (1999) accuracy indices were
calculated from the Wyoming output. These indices evaluate the quality of the AGGIES' editing
and imputation procedures based on the number of detected, undetected and introduced errors. The
indices are divided into three groups of three: the first three indices assess the quality of editing, the
next three assess the quality of imputation and the final three assess the overall quality of both
editing and imputation. Table 4 describes each index and supplies its formula. Table 4. Accuracy Indices (Manzari and Rocca, 1999)
Where: modified = survey edited data that does not equal the reported data
unmodified = survey edited data that equals the reported data
changed = AGGIES edited data that does not equal the reported data
not changed = AGGIES edited data that equals the reported data
a = number of modified data identified to be changed in AGGIES as = number of modified data identified to be changed by AGGIES and imputation was successful af = number of modified data identified to be changed by AGGIES and imputation failed b = number of modified data identified not to be changed by AGGIES c = number of unmodified data identified to be changed by AGGIES cs = number of unmodified data identified to be changed by AGGIES and imputation was successful cf = number of unmodified data identified to be changed by AGGIES and imputation failed d = number of unmodified data identified not to be changed by AGGIES Whether the AGGIES imputation was classified as a success or a failure was based on comparing
the imputed value to the survey edited value. If the imputed value was exclusively within five
percent of the survey edited value, the imputation was categorized as successful; otherwise, it was
a failure.
In Table 5, for a subset of the Wyoming variables, the accuracy indices are given averaged over the
three AGGIES runs. Indices range from 0% (no accuracy) to 100% (maximum accuracy). The table
is sorted in ascending order by index I9, which measures the overall accuracy for the whole data set.
Table 5. Average Value of the Accuracy Indices, Given In Percent
The high values for index I1 indicate that the error localization algorithm performs well and that
AGGIES introduces few new errors in the data. The I2 index measures the ability to detect errors
in the data. Note that all four of the I2 values less than 75 are linked to variables with high reliability
weights. Simply changing reliability weights, however, will not necessarily increase the I2 index.
In order to significantly increase AGGIES' power to detect errors in these variables, new
inconsistencies, i.e., new edits, would need to be defined. High index I3 values indicate that the
overall AGGIES editing accuracy for all variables is strong.
The lack of unmodified data imputed by AGGIES, c, somewhat diminishes the ability of indices I4
and I6 to assess imputation quality. Index I5, which measures the effectiveness of AGGIES to
impute values within five percent of the survey edited values, shows that AGGIES does well for six
of the ten variables. AGGIES' imputation estimators as well as current survey imputation
procedures for the variables 'wool from market animals', 'wool from breeding animals', 'market
animals shorn', and 'breeding animals shorn' need to be reviewed to identify the cause of the
inconsistencies in the imputed values.
Indices I7 and I8 indicate the overall editing and imputation quality for unmodified and modified
data respectively. AGGIES appears to handle unmodified data more effectively; however, given the
overall editing and imputation quality index, I9, the system, on the whole, was proficient in treating
the data for all variables.
Another evaluation of AGGIES currently underway uses the July 1999 Sheep Report data for the
same four states, California, Colorado, Texas and Wyoming. This project took place parallel to the
live production using the same general procedures used for the January data: minimal hand editing,
minimized Blaise interactive edit checks, and batch runs through AGGIES. Deviating from the past
evaluation, outlying reports and records that exceeded the imposed error localization time limit were
interactively reviewed in the interactive AGGIES module. Also, in contrast to the previous
evaluation, the AGGIES output file was reviewed at the macro level in IDAS with updates made in
the AGGIES interactive module. It will be this post-AGGIES/IDAS data that will be compared to
the survey edited data. This evaluation should more aptly mimic the entire proposed survey
processing procedures.
The last ongoing AGGIES project uses the 1997 Census of Agriculture hog data for Iowa and North
Carolina. Current survey edit specifications were used to develop the AGGIES edits, reliability
weights, and imputation estimators. Approximately 17,000 Iowa records and 3,000 North Carolina
records with positive, original keyed hog data were obtained. A preliminary run, using the 17,000
Iowa reports, was completed in order to get an idea of the time requirements of the system. The error
localization module took 3.5 hours to run on a 400-MHz Pentium, 55% of the records had errors and
of these, 9% exceeded a five second time limit. Further research using this data will be conducted
and a comparison to the final Census hog numbers made.
Conclusions
Using AGGIES has several potential advantages for NASS:
Commodity data editing and imputation performed by AGGIES results in a data set similar to the
one produced by NASS, as demonstrated using the data from the 1996 Iowa Quarterly Hog Report
and the January 1999 Sheep Report for California, Colorado, Texas and Wyoming. This minimizes
the need for manually reviewing and correcting the data, thereby, allowing for a more efficient
review of the data with the potential for cost and time savings. In addition, there should be more
time for review by statisticians at the macro level.
The editing and imputation functions of AGGIES are performed objectively which allows for
consistency throughout the data cleaning process with the results being nearly repeatable. Only when
there are multiple solutions identified in the error localization module can the results of separate
repetitions through the system differ.
The system can be easily applied to any number of surveys, thus conserving resources to the
development and maintenance of a single system. Survey edits and imputation parameters are the
main inputs that need to be maintained on a survey-to-survey basis.
However, there are several issues to address when using the AGGIES for NASS surveys and the
Agricultural Census:
AGGIES will not perform all editing functions. It is designed for continuous, non-negative data.
Editing of completion codes and data adjustment factors must be performed outside of the system.
A plan as to how AGGIES could be implemented in NASS's survey editing and imputation process
to form a complete edit and imputation strategy and system integration needs to be finalized. The
integration of AGGIES with NASS's IDAS is already underway. Results from the July 1999 Sheep
Report pilot study will provide insight into completing this integration and should lay the
groundwork for AGGIES' implementation plans and additional integrations.
References Cotton, C. (1993), "Functional Description of the Generalized Edit and Imputation System,"
Statistics Canada Technical Report.
De Jong, W. A. M., (1996), "Designing a Complete Edit Strategy; Combining Techniques," Working
Paper No. 29, Conference of European Statisticians, Work Session on Statistical Data Editing,
United Nations Statistical Commission and Economic Commission for Europe.
Manzari, A., and Della Rocca, G. (1999), "A Generalised System Based On a Simulation Approach
to Test the Quality of Editing and Imputation Procedures," Working Paper No. 13, Conference of
European Statisticians, Work Session on Statistical Data Editing, United Nations Statistical
Commission and Economic Commission for Europe.
Pense, R. (1997), "Editing Strategies at the National Agricultural Statistics Service," Working Paper
No. 31, Conference of European Statisticians, Work Session on Statistical Data Editing, United
Nations Statistical Commission and Economic Commission for Europe.
Schiopu-Kratina, I.., and Kovar, J. G. (1989), "Use of Chernikova's Algorithm in the Generalized
Edit and Imputation System," Statistics Canada, Methodology Branch Working Paper No. BSMD-89-001E.
Todaro, T. A., (1997), "Evaluation of the SPEER Automatic Edit and Imputation System," National
Agricultural Statistics Service, USDA, Washington D.C., RD Research Report No. RD-97-04.
Todaro, T. A., (1999a), "Evaluation of the AGGIES Automated Edit and Imputation System,"
National Agricultural Statistics Service, USDA, Washington D.C., RD Research Report No. RD-99-01.
Todaro, T. A., (1999b), "Overview and Evaluation of the AGGIES Automated Edit and Imputation
System," Working Paper No. 19, Conference of European Statisticians, Work Session on Statistical
Data Editing, United Nations Statistical Commission and Economic Commission for Europe.
VariableAGGIES Expanded Total
Standard
Deviation
Run 1
Run 2
Run 3
Wool from Breeding Animals
4,198,744
4,201,843
4,201,843
1,789 Wool from Market Animals
493,216
494,612
494,612
806 Breeding Animals Shorn
446,837
448,106
448,106
733 Market Animals Shorn
121,925
122,273
122,273
201 Total Sheep and Lambs
574,431
574,415
574,244
104 Market Lambs 85 to 105 lbs.
73,393
73,381
73,381
7 Rams for Breeding
12,215
12,214
12,214
1 Market Lambs Over 105 lbs.
29,419
29,419
29,419
0 Market Sheep
2,618
2,618
2,618
0 Ewes for Breeding
363,222
363,222
363,222
0
Variable
Weight
AGGIES Average
Survey Edited
Percent Difference Market Sheep
3
2,618
1,739
50.55 Breeding Animals Shorn
2
447,683
443,548
0.93 Total Sheep and Lambs
1
574,363
570,890
0.61 Ewes for Breeding
4
363,222
361,095
0.59 Wool from Breeding Animals
1
4,200,810
4,225,283
-0.58 Rams for Breeding
5
12,214
12,153
0.50 Market Animals Shorn
2
122,157
121,809
0.29 Wool from Market Animals
1
494,146
493,623
0.11 Market Lambs Over 105 lbs.
2
29,419
29,411
0.03 Market Lambs 85 to 105 lbs.
1
73,385
73,381
0.01
Assessing ...
Index
Calculation
Editing
QualityI1: fraction of unmodified data correctly handled
d/(c+d)
I2: fraction of modified data correctly handled
a/(a+b)
I3: fraction of total data correctly handled
(a+d)/(a+b+c+d)
Imputation
QualityI4: fraction of changed, unmodified data whose original value is
correctly restored
cs/c
I5: fraction of changed, modified data whose original value is
correctly restored
as/a
I6: fraction of changed total data whose original value is correctly
restored
(as+cs)/(a+c) Overall
Editing and
Imputation
Quality
I7: fraction of unmodified data whose original value is correctly
restored
(cs+d)/(c+d)
I8: fraction of modified data whose original value is correctly restored
as/(a+b)
I9: fraction of total data whose original value is correctly restored
(as+cs+d)/(a+b+c+d)
AGGIES Edited Data
Changed
Not Changed Survey
Edited Data
Modified
a = as+af
b
Unmodified
c = cs+cf
d
Variable
I1
I2
I3
I4
I5
I6
I7
I8
I9 Wool from Breeding Animals
100
94
99
0
16
16
100
15
86 Breeding Animals Shorn
100
91
99
-
53
53
100
48
97 Wool from Market Animals
99
77
99
0
13
9
99
10
98 Ewes for Breeding
100
50
99
-
100
100
100
50
99 Market Animals Shorn
100
100
100
-
47
47
100
47
99 Rams for Breeding
100
50
99
-
87
87
100
43
99 Total Sheep and Lambs
100
75
99
41
91
78
100
59
99 Market Lambs Over 105 lbs.
100
67
100
-
100
100
100
67
100 Market Lambs 85 to 105 lbs.
100
100
100
0
100
89
100
100
100 Market Sheep
100
40
100
-
100
100
100
40
100