Task 3: How to Calculate Degrees of Freedom for Performing Statistical Tests and Confidence Limits

Once data are sorted in SAS, SUDAAN can be used to specify the sampling design parameters.  In this example, the SUDAAN procedure, proc descript, is used and the name of the dataset is BP_analysis_Data.  Proc descript is being used as a generic example, but these statements apply to all SUDAAN procedures.

 

Step 1: Sorting in SAS

To carry out the appropriate SUDAAN design option for NHANES data, the data from BP_analysis_Data must first be sorted by strata and then by PSU (unless the data have already been sorted by PSU within strata). The proc sort procedure in SAS must precede any SUDAAN statements.

 

warning icon Data must always be sorted in SAS before doing analyses in SUDAAN.

 

 

Step 2: Use proc statement in SUDAAN

Generally, a proc statement in SUDAAN immediately follows the sort statement. In this example, the proc descript statement is used.  In addition, the data option specifies BP_analysis_Data as the SAS dataset being used, the design option specifies with replacement (WR) as the design, and the noprint option suppresses printing of results as the results will output to a SAS data file.

 

Use the DEFT2 option statement to request the calculation of the design effect using SUDAAN Method 2 (see SUDAAN manual for details), which is the method recommended by NCHS for NHANES data.

 

 

Step 3: Specify design parameters in SUDAAN

The nest statement lists the variables that identify the strata and the PSU.  The nest statement is required to indicate the appropriate design effect used in NHANES. As in the sort statement, the nest statement lists the stratum variable (i.e., sdmstra) first, followed by the PSU variable (i.e., sdmvpsu).

 

The weight statement accounts for the unequal probability of sampling and nonresponse. For more information on selecting the correct weight, please see Selecting the Correct Weight in the Weighting module.

 

The subpopn statement sets the subgroup. It is recommended that you use the subpopn statement instead of subsetting the data in the data step in SAS. Please see Creating Appropriate Subsets of Data for NHANES Analyses in the Weighting module for more information.

 

The var option sets the variable of interest. The subgroup and levels statements set the categorical variables of interest and the number of levels corresponding to each categorical variable. The tables statement requests a stratified output of the categorical variables.

 

Step 4: Specify output

In this step, you will specify how the results are saved to a file because the output in the proc descript procedure was suppressed using the noprint option. The filetype option determines the type of data file to be produced and the filename option sets the name of the file to which your results will be saved. If you use the replace option, then every time you run the program, your results will be overwritten with the newer results.

 

In SUDAAN, one must specify the ATLEVEL1 and ATLEVEL2 options in the proc statement in proc descript or proc crosstab to request that PSUs and strata are counted. The ATLEVEL1=1 and ATLEVEL2=2 options specify the sampling stages (in NHANES, the number of strata is level 1, and the number of PSUs is level 2) for which you want counts per table cell. The values 1 and 2 are the positions on the nest statement of the variables used to designate the stages of sampling. These options are associated with the keywords ATLEV1 or ATLEV2 respectively on the print or output statements. ATLEV1 is the number of strata with at least one valid observation and ATLEV2 is the number of PSUs with at least one valid observation. These numbers are used to calculate degrees of freedom.

 

The mean and semean options output the mean and estimated standard error of the mean to the data file. The nsum option outputs the number of observations in each level in each subdomain to the data file. The deff option outputs the design effect for each subdomain to the data file.

 

The rformat option specifies the formats of the levels of each categorical variable in the tables statement. Format statements for each variable must be listed individually. The rtitle option is used to set the title for output for procedure. These options are necessary only when printing the results.

 

Step 5: Use SAS to calculate degrees of freedom and Wald 95% confidence intervals from SUDAAN output

After outputting the strata and PSU variables needed to calculate the degrees of freedom in SUDAAN, you can use SAS to calculate the Wald 95% confidence interval using the correct degrees of freedom for a subdomain.

 

Summary: SUDAAN code to output estimates for calculating 95% confidence limits

The following table shows how to combine the statements described above to properly calculate 95% confidence limits. The procedure proc descript is being used as an example, but the design and nest statements can be used in the same manner for all SUDAAN procedures. Additionally, other procedure options can be added to these statements to customize the analysis and output. Consult the SUDAAN manual for specifications on the options for each SUDAAN procedure.

 

SUDAAN proc descript procedure

Statements

Explanation

proc sort data=BP_analysis_Data;

by sdmvstra sdmvpsu;

run;

Use the SAS procedure, proc sort, to sort the data by the design parameters, strata (sdmvstra) and primary sampling units (sdmvpsu), before running the procedure in SUDAAN.

proc descript data=BP_analysis_Data design=WR noprint

DEFT2 ATLEVEL1=1 ATLEVEL2=2 ;

Use proc descript to specify the dataset (BP_analysis_Data), specify the sample design using the design option WR (with replacement).

Use the noprint option to suppress printing of results. In this example, you will be sending the results to a SAS data file for further calculations and printing in SAS.

Use a DEFT2 statement to request the calculation of the design effect using method 2 (see SUDAAN manual for details on the differing methods for calculating design effect).  Method 2 is the method recommended by NCHS for NHANES data.

The ATLEVEL1=1 and ATLEVEL2=2 options specify the sampling stages (in NHANES, the number of strata is level 1, and the number of PSUs is level 2) for which you want counts per table cell. ATLEV1 is the number of strata with at least one valid observation and ATLEV2 is the number of PSUs with at least one valid observation. These numbers are used to calculate degrees of freedom.

nest sdmvstra sdmvpsu;

Use the nest statement to specify the strata and PSU variables to account for the sample design effects.

Weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and nonresponse. In this example, the MEC weight for four years of data is used.

 SUBPOPN riagendr=2 and ridageyr > 19 and ridageyr < 60 ;

Use a subpopn statement to subset on the subgroup of interest. In this example, it is females (riagendr=2) between ages 20-59 years (ridageyr > 19 and ridageyr < 60).

Var hbp;

Use a var statement to set the variable of interest as the percent with high blood pressure. (In order to generate results that are expressed in percent with the condition, this variable was coded as 0=no high blood pressure and 100=those with high blood pressure.)

class race educ;

Use a class statement to set the categorical variables of interest. In this example, race/ethnic group (race) and education level (educ).

Tables race*educ ;

Use a tables statement to request prevalence of high blood pressure stratified on education level (educ) within each race/ethnic group (race).

Output   atlev1=numstrat atlev2=numpsu  mean=mean semean=semean

NSUM=N deffmean /  filetype=SAS filename=test1 replace;

 

Use the output statement to request output of results to a SAS data file (filetype=SAS) called test1 (filename=test1). 

Use a replace statement to replace this file each time this program is run and updated with the latest results.

Use an atlev1 option to create the SAS data variable, numstrat, with the value obtained from counting the number of strata in each subdomain requested with at least one valid observation.

Use an atlev2 option to create a SAS variable, numpsu, with the value obtained from counting the number of PSU's in each subdomain requested with at least one valid observation.

Use the mean option to output the mean to the SAS data set.  In this example, the mean is the percent of individuals at each level with high blood pressure.

Use the semean option to output the standard error of the mean estimated above to the SAS dataset.

Use the nsum option to create the variable N in the SAS dataset which gives the number of observations in each level in each subdomain requested in the table statement.

Use the deffmean option to output the design effect for each subdomain requested.

Rformat race racef.;

Rformat educ educf.;

Use an  rformat  option to specify formats for the levels of each categorical variable in the tables statement as needed. Format statements for each variable must be listed individually.  In this example, you are setting the formats for the race/ethnic group (race) and education level (educ) variables.

Rtitle "Prevalence of high blood pressure by race and education level for women age 20-59";

Use the rtitle option to set the title for output for procedure.

 

Calculate 95% confidence intervals with SAS

Statements

Explanation

Proc sort data=test1;

   by race educ;

run;

 
Use the SAS procedure, proc sort, to sort the data.

DATA test2;  SET test1;

Use the data statement to create a new dataset (test2) and the set statement to read in the data file created in SUDAAN.
  if race=4 then delete; Use an if, then statement to delete the other race (race=4)category. This subgroup is not reported for analysis.

percent=round(mean,.01);

sepercent=round(semean,.01);

Create the variables percent and sepercent and set them equal to a rounded value of the estimates using the round function.

df=atlev2-atlev1;

Calculate the degrees of freedoms by subtracting the PSU (atlev2) from the stratum(atlev1).

 

tlow=tinv(.025,df);

tup=tinv(.975,df);

Calculate the t-statistic using the tinv function, which computes the percentile for the t-distribution with the degrees of freedom (df).

rse=round((semean/mean)*100,.01);

rsese=round((1/sqrt(df)),.01);

Calculate the relative standard error and the relative standard error of the standard error. These are useful for determining the reliability of estimates but are not used for confidence limit calculations.

ll=round((mean+tlow*semean),.01);

ul=round((mean+tup*semean),.01);

Calculate the upper and lower confidence limits.

proc print;

VAR Race Educ Percent Sepercent df deffmean tup ll ul;

title1 'Degrees of Freedom and Wald 95% Confidence Interval';

title2 'Race and Education of High Blood Pressure NHANES1999-2002';

run;

Use the proc print procedure to output the results.

 

Use the var statement to indicate variables of interest (race (race); education level (educ); percent of people in the category (percent); standard error of the percent (sepercent );  degrees of freedom (df); design effect (deffmean); t-statistic upper limit (tup); lower confidence limit (ll); and upper confidence limit (ul).

 

close window icon Close Window